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Abstract 

We describe the newly written code GADGET which is suitable both for cosmological simulations of 
structure formation and for the simulation of interacting galaxies. GADGET evolves self-gravitating 
collisionless fluids with the traditional N-body approach, and a collisional gas by smoothed particle 
hydrodynamics. Along with the serial version of the code, we discuss a parallel version that has been 
designed to run on massively parallel supercomputers with distributed memory. While both versions 
use a tree algorithm to compute gravitational forces, the serial version of GADGET can optionally 
employ the special-purpose hardware GRAPE instead of the tree. Periodic boundary conditions are 
supported by means of an Ewald summation technique. The code uses individual and adaptive 
timesteps for all particles, and it combines this with a scheme for dynamic tree updates. Due to its 
Lagrangian nature, GADGET thus allows a very large dynamic range to be bridged, both in space and 
time. So far, GADGET has been successfully used to run simulations with up to 7.5 xlO 7 particles, 
including cosmological studies of large-scale structure formation, high-resolution simulations of the 
formation of clusters of galaxies, as well as workstation-sized problems of interacting galaxies. In 
this study, we detail the numerical algorithms employed, and show various tests of the code. We 
publically release both the serial and the massively parallel version of the code. 
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1. Introduction 

Numerical simulations of three-dimensional self-gravitating 
fluids have become an indispensable tool in cosmology. They 
are now routinely used to study the non-linear gravitational 
clustering of dark matter, the formation of clusters of galax- 
ies, the interactions of isolated galaxies, and the evolution of 
the intergalactic gas. Without numerical techniques the im- 
mense progress made in these fields would have been nearly 
impossible, since analytic calculations are often restricted 
to idealized problems of high symmetry, or to approximate 
treatments of inherently nonlinear problems. 

The advances in numerical simulations have become pos- 
sible both by the rapid growth of computer performance and 
by the implementation of ever more sophisticated numerical 
algorithms. The development of powerful simulation codes 
still remains a primary task if one wants to take full advan- 
tage of new computer technologies. 

Press 
et al. 



useful in collisional stellar dynamical systems, but it is inef- 
ficient for large N due to the rapid increase of its computa- 
tional cost with N. A large number of groups have therefore 
developed N-body codes for collisionless dynamics that com- 
pute the large-scale gravitational field by means of Fourier 
techniques. These are the PM, P 3 M, and AP 3 M codes (East- 
wood & Hockney 1974; |Hohl 1978|; Hockney & Eastwood 
1981; Efstathiou et al. 1985|; |Couchman 199U Bertschinger 



& Gelb 1991; MacFarland et al. 1998). Modern versions 



of these codes supplement the force computation on scales 
below the mesh size with a direct summation, and/or they 
place mesh refinements on highly clustered regions. Pois- 
son's equation can also be solved on a hierarchically refined 
mesh by means of finite-differenc e relaxation methods, an ap- 
proach taken in the ART code by Kravtsov et al. (1997). 



An alternative to these schemes are the so-called tree al- 
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1979, among others) largely employed the direct summation 
method for the gravitational N-body problem, which remains 



gorithms, pioneered by Appel (1981, 1985). Tree algorithms 
arrange particles in a hierarchy of groups, and compute the 
gravitational field at a given point by summing over multi- 
pole expansions of these groups. In this way the computa- 
tional cost of a complete force evaluation can be reduced to 
a 0(N log N) scaling. The grouping itself can be achieved 
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in various ways, for example with Eulerian subd ivisions of 
space ( Barnes & Hut 1986j), or with neare st-neigl lbour pair- 
ings ( Press 1986| ; Jernigan & Porter 1989). A technique re- 
lated to ordinary tree algorithms is the fast multipc ile-method 
(e.g. 3reengard & Rokhlin 1987), where multipale expan- 
sions are carried out for the gravitational field in t region of 
space. 

While mesh-based codes are generally much 
close-to-homogeneous particle distributions, tree 
adapt flexibly to any clustering state without sig 
in speed. This Lagrangian nature is a great 
large dynamic range in density needs to be cove|r 
tree codes can outperform mesh based algorithm 
tion, tree codes are basically free from any g 
strictions, and they can be easily combined with 
schemes that advance particles on individual time; 

Recently, PM and tre e solvers have been co ir 
hybrid Tree-PM codes ( [Xu 1995| ; |Bagla 1999| ; 
2000). In this approach, the speed and accuracy 
method for the long-range part of the gravitation; 
combined with a tree-computation of the short-] 
This may be seen as a replacement of the direct 
PP part in P 3 M codes with a tree algorithm. Th 
technique is clearly a promising new method, e 
large cosmological volumes with strong clusterin 
scales are studied. 

Yet another approach to the N-body problem 
by special-purpose hardware like the GRAPE boaijd 
1 990; [ttoetal. 199l"];|Fukushigeet al. 1991|; Makim > 
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1993; Ebisuzaki et al. 1993; Okumura et al. 1993; 
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sists of custom chips that compute gravitational 
direct summation technique. By means of their 
computational speed they can considerably extend 
where direct summation remains competitive witl 
ware solutions. A recent overview of the family 
boards is given by Hut & Makino (1999 ). The n 
eration of GRAPE technology, the GR APE-6, wil 
peak performance of up to 100 TFlops ( [Vlakino 



ing direct simulations of dense stellar systems w 
numbers approaching 10 6 . Using sophisticated 



GRAPE may also be combined with P ^M ( 3rieu c- 1 al. 1995 ) 



or tree algorithms (Fukushige et al. 1991 
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Athanassoula et al. 1998) to maintain its high computational 
speed even for much larger particle numbers 

In recent years, collisionless dynamics has also 
pled to gas dynamics, allowing a more direct link 
able quantities. Traditionally, hydrodynamical 
have usually employed some kind of mesh to represent 
dynamical quantities of the fluid. While a particul ar 
of these codes is their ability to accurately resolve 
mesh also imposes restrictions on the geometry 
lem, and onto the dynamic range of spatial scales 
simulated. New adaptive mesh refinement codes 
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gical applications, it is often sufficient to de- 
by smoothed particle hydrodynamics (SPH), as 
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'antages have led a number of authors to develop 
'or applications in cosmology. Among them are 
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Klein et al. 1998) have been developed to pro- 



based SPH is extremely flexible in its ability to 
given geometry. Moreover, its Lagrangian nature 
changing resolution that 'automatically' fol- 
mass density. This convenient feature helps to 
ing time by focusing the computational effort on 
s that have the largest gas concentrations. Fur- 
ties naturally into the N-body approach for 
and can be easily implemented in three dimen- 



plper we describe our simulation code GADGET 
with Dark matter and Gas intEracT), which can 
both for studies of isolated self-gravitating systems 
s, or for cosmological N-body/SPH simulations, 
de ^eloped two versions of this code, a serial work- 
on, and a version for massively parallel super- 
\Kdth distributed memory. The workstation code 
tree algorithm for the self-gravity, or the special- 
har I ware GRAPE, if available. The parallel version 
i tree only. Note that in principle several GRAPE 
connected to a separate host computer, can be 
work as a large parallel machine, but this possi- 
implemented in the parallel code yet. While the 
1 irgely follows known algorithmic techniques, we 
novel parallelization strategy in the parallel version. 

emphasis of our work has been on the use of a 
ration scheme with individual and adaptive particle 
and on the elimination of sources of overhead both 
and parallel code under conditions of large dy- 
in timestep. Such conditions occur in dissipative 
al simulations of galaxy formation, but also in 
simulations of cold dark matter. The code al- 
ge of different timestep criteria and cell-opening 
it can be comfortably applied to a wide range 
including cosmological simulations (with or 
ic boundaries), simulations of isolated or inter- 
ies, and studies of the intergalactic medium, 
hink that GADGET is a very flexible code that 
intrinsic restrictions for the dynamic range of 
that can be addressed with it. In this methods- 
describe the algorithmic choices made in GADGET 
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which we release in its parallel and serial versions on the in- 
ternet]], hoping that it will be useful for people working on 
cosmological simulations, and that it will stimulate code de- 
velopment efforts and further code-sharing in the community. 

This paper is structured as follows. In Section ||, we give 
a brief summary of the implemented physics. In Section [| 
we discuss the computation of the gravitational force both 
with a tree algorithm, and with GRAPE. We then describe our 
specific implementation of SPH in Section ^, and we discuss 
our time integration scheme in Section |[ The parallelization 
of the code is described in Section [| and tests of the code are 
presented in Section (7]. Finally, we summarize in Section ||. 

2. Implemented physics 

2.1 . Collisionless dynamics and gravity 

Dark matter and stars are modeled as self-gravitating col- 
lisionless fluids, i.e. they fulfill the collisionless Boltzmann 
equation (CBE) 



^ = ^ + v £^_^£fV =0 

dt dt <9x dr dv 



(1) 



where the self-consistent potential $ is the solution of Pois- 
son's equation 



V 2 $(r,i) =4ttG J /(r,v,i)dv, 



(2) 



and /(r, v, t) is the mass density in single-particle phase- 
space. It is very difficult to solve this coupled system of equa- 
tions directly with finite difference methods. Instead, we will 
follow the common N-body approach, where the phase fluid 
is represented by N particles which are integrated along the 
characteristic curves of the CBE. In essence, this is a Monte 
Carlo approach whose accuracy depends crucially on a suffi- 
ciently high number of particles. 

The N-body problem is thus the task of following Newton's 
equations of motion for a large number of particles under 
their own self-gravity. Note that we will introduce a soften- 
ing into the gravitational potential at small separations. This 
is necessary to suppress large-angle scattering in two-body 
collisions and effectively introduces a lower spatial resolution 
cut-off. For a given softening length, it is important to choose 
the particle number large enough such that relaxation effects 
due to two-body encounters are suppressed sufficiently, oth- 
erwise the N-body system provides no faithful model for a 
collisionless system. Note that the optimum choice of soft- 
ening length as a function of particle density is an issue that 
is still actively discussed in the literature (e.g . Splinter et al. 
1998; |Romeo 1998| ; |Athanassoula et al. 2000| ). 



'GADGET'S web-site is: 



2.2. Gasdynamics 

A simple description of the intergalactic medium (IGM), or 
the interstellar medium (ISM), may be obtained by modeling 
it as an ideal, inviscid gas. The gas is then governed by the 
continuity equation 



dp 
dt 



pV ■ v = 0, 



and the Euler equation 

dv _ VP 



(3) 



(4) 



Further, the thermal energy u per unit mass evolves according 
to the first law of thermodynamics, viz. 



du P v A(u,p) 
dt p p 

Here we used Lagrangian time derivatives, i.e. 

d d 



dt dt 



v V, 



(5) 



(6) 



and we allowed for a piece of 'extra' physics in form of the 
cooling function A(it, p), describing external sinks or sources 
of heat for the gas. 

For a simple ideal gas, the equation of state is 



P = (7 - l)pu, 



(7) 



where 7 is the adiabatic exponent. We usually take 7 = 5/3, 
appropriate for a mono-atomic ideal gas. The adiabatic 
sound speed c of this gas is c 2 = jP/p. 



3. Gravitational forces 
3.1. Tree algorithm 

An alternative to Fourier techniques, or to direct summation, 
are the so-called tree methods. In these schemes, the particles 
are arranged in a hierarchy of groups. When the force on a 
particular particle is computed, the force exerted by distant 
groups is approximated by their lowest multipole moments. 
In this way, the computational cost for a complete force eval- 
uation can be reduced to order 0(N log N) (Appel 1985). 
The forces become more accurate if the multipole expansion 
is carried out to higher order, but eventually the increasing 
cost of evaluating higher moments makes it more efficient 
to terminate the multipole expansion and rather use a larger 
number of smaller tree nodes to achieve a desired force accu- 
racy (|McMillan & Aarseth 1993|). We will follow the com- 
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mon compromise to terminate the expansion after quadrupole 
moments have been included. 
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Figure 1: Schematic illustration of the Barnes & Hut oct-tree in two dimensions. The particles are first enclosed in a square (root node). This 
square is then iteratfvely subdivided in four squares of half the size, until exactly one particle is left in each final square (leaves of the tree). In 
the resulting tree structure, each square can be progenitor of up to four siblings. Note that empty squares need not to be stored. 



We employ the Barnes & Hut (1986, henceforth BH) tree 
construction in this work. In this scheme, the computational 
domain is hierarchically partitioned into a sequence of cubes, 
where each cube contains eight siblings, each with half the 
side-length of the parent cube. These cubes form the nodes 
of an oct-tree structure. The tree is constructed such that each 
node (cube) contains either exactly one particle, or is pro- 
genitor to further nodes, in which case the node carries the 
monopole and quadrupole moments of all the particles that 
lie inside its cube. A schematic illustration of the BH tree is 
shown in Figure [I]. 

A force computation then proceeds by walking the tree, 
and summing up appropriate force contributions from tree 
nodes. In the standard BH tree walk, the multipole expan- 
sion of a node of size I is used only if 



I 

r > - . 



(8) 



where r is the distance of the point of reference to the center- 
of-mass of the cell and 9 is a prescribed accuracy parameter. 
If a node fulfills the criterion (§), the tree walk along this 
branch can be terminated, otherwise it is 'opened', and the 
walk is continued with all its siblings. For smaller values of 
the opening angle, the forces will in general become more 
accurate, but also more costly to compute. One can try to 
modify the opening criterion (^|) to obtain higher efficiency, 
i.e. higher accuracy at a given length of the interaction list, 



something that we will discuss in more detail in Section 3.3 



A technical difficulty arises when the gravity is softened. 
In regions of high particle density (e.g. centers of dark haloes, 
or cold dense gas knots in dissipative simulations), it can hap- 
pen that nodes fulfill equation (|8]), and simultaneously one 
has r < h, where h is the gravitational softening length. In 
this situation, one formally needs the multipole moments of 
the softened gravitational field. One can work around this sit- 
uation by opening nodes always for r < h, but this can slow 
down the code significantly if regions of very high particle 
density occur. Another solution is to use the proper multipole 



expansion for the softened potential, which we here discuss 
for definiteness. We want to approximate the potential at r 
due to a (distant) bunch of particles with masses rrii and co- 
ordinates x^. We use a spline-softened force law, hence the 
exact potential of the particle group is 



$(r) 



-G^2m k g(\xk -r|) , 
fc 



(9) 



where the function g(r) describes the softened force law. For 
Newtonian gravity we have g(r) = 1 /r, while the spline soft- 
ened gravity with softening length h gives rise to 



9(r) 



1 / r 
-~h W > (k 



(10) 



The function is given in the Appendix. It arises by 

replacing the force due to a point mass m with the force ex- 
erted by the mass distribution p(r) = mW(r; h), where we 
take W(r; h) to be the normalized spline kernel used in the 
SPH formalism. The spline softening has the advantage that 
the force becomes exactly Newtonian for r > h, while some 
other possible force laws, like the Plummer softening, con- 
verge relatively slowly to Newton's law. 

Let s be the center-of-mass, and M the total mass of the 
particles. Further we define y = r— s. The potential may then 
be expanded in a multipole series assuming |y| 3> |xj; — s|. 
Up to quadrupole order, this results in 



$(r) 



G\ Mg(y) 

1 -r 



(ID 



^Q + #(P-Q) 



y v 

Here we have introduced the tensors 

Q = ^m/c(x fc - s)(x fe - s) T = ^m/cXfc x[ -Mss 



(12) 
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and 



P = I^m fc (x fe -s) 2 =I 



^rn fc x 2 - Ms 2 
L k 



(13) 



where I is the unit matrix. Note that for Newtonian gravity, 
equation ([lip reduces to the more familiar form 



$(r) 



-G 



M 

y 



1 T 3Q — P 
2 y y~ 5 y 



(14) 



Finally, the quadrupole approximation of the softened gravi- 
tational field is given by 



f(r) = -V$ = G{M gi (y) y + g 2 (y)Qy 

+\gM (y T Qy) y + \gi{y)?y 



(15) 



Here we introduced the functions gi(y), 92(y), 9z{y), and 
g4(y) as convenient abbreviations. Their definition is given 
in the Appendix. In the Newtonian case, this simplifies to 



f(r)=G - 



M 3Q 15 y T Qy 3P 
y 5 Y 2 y 7 Y 2 y 5 ' 



(16) 



Note that although equation (|T3j> looks rather cumbersome, 
its actual numerical computation is only marginally more 
costly than that of the Newtonian form ( p"6| ) because all fac- 
tors involving g(y) and derivatives thereof can be tabulated 
for later use in repeated force calculations. 

3.2. Tree construction and tree walks 

The tree construction can be done by inserting the particles 
one after the other in the tree. Once the grouping is com- 
pleted, the multipole moments of each node can be recur- 
sively computed from the moments of its daughter nodes 
( [McMillan & Aarseth 1993] ). 

In order to reduce the storage requirements for tree nodes, 
we use single-precision floating point numbers to store node 
properties. The precision of the resulting forces is still fully 
sufficient for collisionless dynamics as long as the node prop- 
erties are calculated accurately enough. In the recursive cal- 
culation, node properties will be computed from nodes that 
are already stored in single precision. When the particle num- 
ber becomes very large (note that more than 10 million par- 
ticles can be used in single objects like clusters these days), 
loss of sufficient precision can then result for certain particle 
distributions. In order to avoid this problem, GADGET op- 
tionally uses an alternative method to compute the node prop- 
erties. In this method, a link-list structure is used to access 
all of the particles represented by each tree node, allowing a 
computation of the node properties in double-precision and 
a storage of the results in single-precision. While this tech- 
nique guarantees that node properties are accurate up to a rel- 
ative error of about 10 -6 , it is also slower than the recursive 



computation, because it requires of order O (Af log J\f) oper- 
ations, while the recursive method is only of order 0(Af). 

The tree-construction can be considered very fast in both 
cases, because the time spent for it is negligible compared to 
a complete force walk for all particles. However, in the in- 
dividual time integration scheme only a small fraction of all 
particles may require a force walk at each given timestep. If 
this fraction drops below ~ 1 per cent, a full reconstruction 
of the tree can take as much time as the force walk itself. For- 
tunately, most of this tree construction time can be eliminated 
by dynamic tree updates ( McMillan &_Aarseth 1993 ), which 
we discuss in more detail in Section The most time con- 
suming routine in the code will then always remain the tree 
walk, and optimizing it can considerably s peed up tree co des. 
Interestingly, in the grouping technique of Barnes (199d| ), the 
speed of the gravitational force computation can be increased 
by performing a common tree-walk for a localized group of 
particles. Even though the average length of the interaction 
list for each particles becomes larger in this way, this can 
be offset by saving some of the tree-walk overhead, and by 
improved cache utilization. Unfortunately, this advantage is 
not easily kept if individual timesteps are used, where only 
a small fraction of the particles are active, so we do not use 
grouping. 

GADGET allows different gravitational softenings for parti- 
cles of different 'type' . In order to guarantee momentum con- 
servation, this requires a symmetrization of the force when 
particles with different softening lengths interact. We sym- 
metrize the softenings in the form 



h = max(/ij, hj). 



(17) 



However, the usage of different softening lengths leads to 
complications for softened tree nodes, because strictly speak- 
ing, the multipole expansion is only valid if all the particles 
in the node have the same softening. GADGET solves this 
problem by constructing separate trees for each species of 
particles with different softening. As long as these species 
are more or less spatially separated (e.g. dark halo, stellar 
disk, and stellar bulge in simulations of interacting galaxies), 
no severe performance penalty results. However, this is dif- 
ferent if the fluids are spatially well 'mixed'. Here a single 
tree would result in higher performance of the gravity com- 
putation, so it is advisable to choose a single softening in this 
case. Note that for SPH particles we nevertheless always cre- 
ate a separate tree to allow its use for a fast neighbour search, 
as will be discussed below. 

3.3. Cell-opening criterion 

The accuracy of the force resulting from a tree walk depends 
sensitively on the criterion used to decide whether the multi- 
pole approximation for a given node is acceptable, or whether 
the node has to be 'opened' for further refinement. The stan- 
dard BH opening criterion tries to limit the relative error of 
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every particle-node interaction by comparing a rough esti- 
mate of the size of the quadrupole term, ~ Ml 2 /r 4 , with 
the size of the monopole term, ~ M/r 2 . The result is the 
purely geometrical criterion of equation ( |). 

However, as Salmon & Warren (1994) have pointed out, 
the worst-case behaviour of the BH criterion for commonly 
employed opening angles is somewhat worrying. Although 
typically very rare in real astrophysical simulations, the ge- 
ometrical criterion (Q) can then sometimes lead to very large 
force errors. In order to cure this problem, a number of mod- 
ifications of t he cell-opening criterio n have been proposed. 
For example, Dubinski et al. (1996 ) have used the simple 
modification r > 1/6 + 8, where the quantity 5 gives the dis- 
tance of the geometric center of the cell to its center-of-mass. 
This provides protection against pathological cases where the 
center-of-mass lies close to an edge of a cell. 

Such modifications can help to reduce the rate at which 
large force errors occur, but they usually do not help to deal 
with another problem that arises for geometric opening cri- 
teria in the context of cosmological simulations at high red- 
shift. Here, the density field is very close to being homoge- 
neous and the peculiar accelerations are small. For a tree al- 
gorithm this is a surprisingly tough problem, because the tree 
code always has to sum up partial forces from all the mass 
in a simulation. Small net forces at high z then arise in a 
delicate cancellation process between relatively large partial 
forces. If a partial force is indeed much larger than the net 
force, even a small relative error in it is enough to result in a 
large relative error of the net force. For an unclustered par- 
ticle distribution, the BH criterion therefore requires a much 
smaller value of the opening angle than for a clustered one 
in order to achieve a similar level of force accuracy. Also 
note that in a cosmological simulation the absolute sizes of 
forces between a given particle and tree-nodes of a certain 
opening angle can vary by many orders of magnitude. In this 
situation, the purely geometrical BH criterion may end up in- 
vesting a lot of computational effort for the evaluation of all 
partial forces to the same relative accuracy, irrespective of the 
actual size of each partial force and the size of the absolute 
error thus induced. It would be better to invest more compu- 
tational effort in regions that provide most of the force on the 
particle and less in regions whose mass content is unimpor- 
tant for the total force. 



As suggested by Salmon & Warren (1994 ), one may there- 
fore try to devise a cell-opening criterion that limits the abso- 
lute error in every cell-particle interaction. In principle, one 
can use analytic error bounds (Salmon & Warren 1994) to 
obtain a suitable cell-opening criterion, but the evaluation of 
the relevant expressions can consume significant amounts of 
CPU time. 

Our approach to a new opening criterion is less stringent. 
Assume the absolute size of the true total force is already 
known before the tree walk. In the present code, we will use 
the acceleration of the previous timestep as a handy approx- 



imate value for that. We will now require that the estimated 
error of an acceptable multipole approximation is some small 
fraction of this total force. Since we truncate the multipole 
expansion at quadrupole order, the octupole moment will in 
general be the largest term in the neglected part of the se- 
ries, except when the mass distribution in the cubical cell 
is close to being homogeneous. For a homogeneous cube 
the octupole moment vanishes by symmetry (Barnes & Hut 
1989), such that the hexadecapole moment forms the leading 
term. We may very roughly estimate the size of these terms as 
~ M/r 2 (l/r) 3 , or ~ M/r 2 (l/r) 4 , respectively, and take this 
as a rough estimate of the size of the truncation error. We can 
then require that this error should not exceed some fraction a 
of the total force on the particle, where the latter is estimated 
from the previous timestep. Assuming the octupole scaling, 
a tree-node has then to be opened if M I 3 > a|a Q id |r 5 . How- 
ever, we have found that in practice the opening criterion 



M Z 4 > a|a old |r 6 



(18) 



provides still better performance in the sense that it produces 
forces that are more accurate at a given computational ex- 
pense. It is also somewhat cheaper to evaluate during the 
tree-walk, because r 6 is simpler to compute than r 5 , which 
requires the evaluation of a root of the squared node dis- 
tance. The criterion (|l8]) does not suffer from the high-z prob- 
lem discussed above, because the same value of a produces 
a comparable force accuracy, independent of the clustering 
state of the material. However, we still need to compute the 
very first force using the BH criterion. In Section 7.2 we 



will show some quantitative measurements of the relative per- 
formance of the two criteria, and compare it to the optimum 
cell-opening strategy. 

Note that the criterion ( p"8| ) is not completely safe from 
worst-case force errors either. In particular, such errors can 
occur for opening angles so large that the point of force eval- 
uation falls into the node itself. If this happens, no upper 
bound on the force error can be guaranteed (Salmon & War- 
ren 1994). As an option to the code, we therefore combine 
the opening criterion ( |l8| ) with the requirement that the point 
of reference may not lie inside the node itself. We formulate 
this additional constraint in terms of r > 6 max , where b max is 
the maximum distance of the center-of-mass from any point 
in the cell. This additional geometrical constraint provides a 
very conservative control of force errors if this is needed, but 
increases the number of opened cells. 

3.4. Special purpose hardware 

An alternative to software solutions to the N 2 -bottleneck 
of self-gravity is provided by the GRAPE (GRAvity PipE) 
special-purpose hardware. It is designed to solve the grav- 
itational N-body problem in a direct summation approach 
by means of its superior computational speed. The latter is 
achieved with custom chips that compute the gravitational 
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force with a hardwired Plummer force law. The Plummer- 
potential of GRAPE takes the form 



$(r) 



(\ T - Tj \* +<?)'■ 



(19) 



As an example, the GRAPE- 3 A boards installed at the MPA 
in 1998 have 40 N-body integrator chips in total with an ap- 
proximate peak performance of 25 GFlops. Recently, newer 
generations of GRAPE boards have achieved even higher 
computational speeds. In fact, with the GRAPE-4 the 1 TFlop 
barrier was broken (Makino et al. 1997), and even faster 



specia l-purpose mac hines are in preparation (Hut & Makino 
1999; |Makino 2000| ). The most recent generation, GRAPE-6, 
can not only compute accelerations, but also its first and sec- 
ond time derivatives. Together with the capability to perform 
particle predictions, these machines are ideal for high-order 
Hermite integration schemes applied in simulations of colli- 
sional systems like star clusters. However, our present code 
is only adapted to the somewhat older GRAPE-3 (Okumura 
et al. 1993), and the following discussion is limited to it. 

The GRAPE-3A boards are connected to an ordinary work- 
station via a VME or PCI interface. The boards consist of 
memory chips that can hold up to 131072 particle coordi- 
nates, and of integrator chips that can compute the forces ex- 
erted by these particles for 40 positions in parallel. Higher 
particle numbers can be processed by splitting them up in suf- 
ficiently small groups. In addition to the gravitational force, 
the GRAPE board returns the potential, and a list of neigh- 
bours for the 40 positions within search radii hi specified by 
the user. This latter feature makes GRAPE attractive also for 
SPH calculations. 

The parts of our code that use GRAPE h ave benefited from 
the code GRAPESPH by [Steinmetz (1996| ), and are similar to 
it. In short, the usage of GRAPE proceeds as follows. For the 
force computation, the particle coordinates are first loaded 
onto the GRAPE board, then GADGET calls GRAPE repeat- 
edly to compute the force for up to 40 positions in parallel. 
The communication with GRAPE is done by means of a con- 
venient software interface in C. GRAPE can also provide lists 
of nearest neighbours. For SPH-particles, GADGET computes 
the gravitational force and the interaction list in just one call 
of GRAPE. The host computer then still does the rest of the 
work, i.e. it advances the particles, and computes the hydro- 
dynamical forces. 

In practice, there are some technical complications when 
one works with GRAPE-3. In order to achieve high computa- 
tional speed, the GRAPE-3 hardware works internally with 
special fixed-point formats for positions, accelerations and 
masses. This results in a reduced dynamic range compared 
to standard IEEE floating point arithmetic. In particular, one 
needs to specify a minimum length scale d m ; n and a minimum 
mass scale m m - m when working with GRAPE. The spatial dy- 
namic range is then given by d m i n [— 2 18 ; 2 18 ] and the mass 
range is m m i n [l; 64e/d m ; n ] (Steinmetz 1996). 



While the communication time with GRAPE scales propor- 
tional to the particle number iV, the actual force computation 
of GRAPE is still an C(iV 2 )-algorithm, because the GRAPE 
board implements a direct summation approach to the grav- 
itational N-body problem. This implies that for very large 
particle number a tree code running on the workstation alone 
will eventually catch up and outperform the combination of 
workstation and GRAPE. For our current set-up at MPA this 
break-even point is about at 300000 particles. 

However, it is also possible to combine GRAPE with a tree 
algorithm ( Fukushige et al. 1991fc [Makino 199"Ta|; A thanas- 
soula et al. 1998; [Kawai et al. 2000| ), for example by export- 
ing tree nodes instead of particles in an appropriate way. Such 
a combination of tree+GRAPE scales as 0(N logiV) and is 
able to outperform pure software solutions even for large N. 

4. Smoothed particle hydrodynamics 



SPH is a powerful Lagrangian technique to solve hydrody- 
namical problems with an ease that is unmatched by grid 
based fluid solvers (see Monaghan 1992 ,for an excellent re- 
view). In particular, SPH is very well suited for three- 
dimensional astrophysical problems that do not crucially rely 
on accurately resolved shock fronts. 

Unlike other numerical approaches for hydrodynamics, the 
SPH equations do not take a unique form. Instead, many for- 
mally different versions of them can be derived. Furthermore, 
a large variety of recipes for specific implementations of force 
symmetrization, determinations of smoothing lengths, and ar- 
tificial viscosity, have been described. Some of these choices 
are crucial for the accuracy and efficiency of the SPH imple- 
mentation, others are only of minor importance. See the re- 
cent work by Phackeret al. (2000[ ) and Lombardi et al. (1999) 
for a discussion of the relative performance of some of these 
possibilities. Below we give a summary of the specific SPH 
implementation we use. 



4.1. Basic equations 

The computation of the hydrodynamic force and the rate of 
change of internal energy proceeds in two phases. In the first 
phase, new smoothing lengths hi are determined for the ac- 
tive particles (these are the ones that need a force update at the 
current timestep, see below), and for each of them, the neigh- 
bouring particles inside their respective smoothing radii are 
found. The Lagrangian nature of SPH arises when this num- 
ber of neighbours is kept either exactly, or at least roughly, 
constant. This is achieved by varying the smoothing length 
hi of each particle accordingly. The hi thus adjust to the 
local particle density adaptively, leading to a constant mass 
resolution independent of the density of the flow. Nelson & 
Papaloizou (1994) argue that it is actually best to keep the 
number of neighbours exactly constant, resulting in the low- 
est level of noise in SPH estimates of fluid quantities, and in 
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the best conservation of energy. In practice, similarly good 
results are obtained if the fluctuations in neighbour number 
remain very small. In the serial version of GADGET we keep 
the number of neighbours fixed, whereas it is allowed to vary 
in a small band in the parallel code. 

Having found the neighbours, we compute the density of 
the active particles as 



p t = ^TnjWjrijihi), 

3 = 1 



(20) 



with 



n, 



where 



and 



-acij(j,ij + 2u[i 




I P%. 



if Vij ■ Tij 

otherwise, 



< 



fi = 



P'ij 



|(V. 



|(V-v) 4 | + |(Vxv) 4 r 
hij(vi- Vi)(ri-rj) 



(27) 



(28) 



(29) 



where r 



Yj, and we compute a new estimate of 



divergence and vorticity as 

Pi(V ■ v)< = Y^m j iy j - vOV<W(r. 



This form of artificial viscosity is the shear-reduced version 
(Balsara 1995; Steinmetz 1990 of the 'standard' Monaghan 
& Gingold (1983) artificial viscosity. Recent studies (Lom- 



(21) bardi et al. 1999; fThacker et al. 2000|) that test SPH imple 



Pi(y x v)i — rrij (Vj — Vj) x ViW(ri 3 ■; hi). (22) act with a particle j whenever |r. y | < hi or |r y | < h 



Here we employ the gather formulation for adaptive smooth- 
ing ( friernquist & Katz 1989] ). 

For the passive particles, values for density, internal en- 
ergy, and smoothing length are predicted at the current time 
based on the values of the last update of those particles (see 
Section |J). Finally, the pressure of the particles is set to 
Pi = (7- 

In the second phase, the actual forces are computed. Here 
we symmetrize the kernels of gather and scatter formulations 
as in frlernquist & Katz (1989 ). We compute the gasdynami- 
cal accelerations as 



mentations endorse it. 

In equations (^3|) and (|4|), a given SPH particle i will inter- 
Stan- 
dard search techniques can relatively easily find all neigh- 
bours of particle i inside a sphere of radius hi, but mak- 
ing sure that one really finds all interacting pairs in the case 
hj > hi is slightly more tricky. One solution to this problem 
is to simply find all neighbours of i inside hi, and to consider 
the force components 



fy = 111,111 , 



iViWfoife). (30) 



/ VP 
V P 




(23) 



and the change of the internal energy as 



dt 



It,' 




-ViWirij-hi) + -ViWirij-hj] 



(24) 



Instead of symmetrizing the pressure terms with an arithmetic 
mean, the code can also be used with a geometric mean ac- 
cording to 

P^ 2 VM. (25) 

Pi Pj PiPj 

This may be slightly more robust in certain situations (Hern- 
quist & Katz 1989). The artificial viscosity Iljj is taken to 
be 

1 



n, 



(26) 



If we add f, 3 to the force on i, and — £y to the force on j, the 
sum of equation ( ^3| ) is reproduced, and the momentum con- 
servation is manifest. This also holds for the internal energy. 
Unfortunately, this only works if all particles are active. In 
an individual timestep scheme, we therefore need an efficient 
way to find all the neighbours of particle i in the above sense, 
and we discuss our algorithm for doing this below. 

4.2. Neighbour search 

In SPH, a basic task is to find the nearest neighbours of each 
SPH particle to construct its interaction list. Specifically, in 
the implementation we have chosen we need to find all par- 
ticles closer than a search radius hi in order to estimate the 
density, and one needs all particles with jr^ | < max(/ij, hj) 
for the estimation of hydrodynamical forces. Similar to grav- 
ity, the naive solution that checks the distance of all particle 
pairs is an 0(N 2 ) algorithm which slows down prohibitively 
for large particle numbers. Fortunately, there are faster search 
algorithms. 

When the particle distribution is approximately homoge- 
neous, perhaps the fastest algorithms work with a search grid 
that has a cell size somewhat smaller than the search radius. 
The particles are then first coarse-binned onto this search 
grid, and link-lists are established that quickly deliver only 
those particles that lie in a specific cell of the coarse grid. 



8 



GADGET: A code for collisionless and gasdynamical cosmological simulations 



The neighbour search proceeds then by range searching; only 
those mesh cells that have a spatial overlap with the search 
range have to be opened. 

For highly clustered particle distributions and varying 
search ranges hi, the above approach quickly degrades, since 
the mesh chosen for the coarse grid has not the optimum size 
for all particles. A more flexible alternative is to employ a 
geometric search tree. For this purpose, a tree with a struc- 
ture just like the BH oct-tree can be employed, Hernquist & 
Katz (1989) were the first to use the gravity tree for this pur- 
pose. In GADGET we use the same strategy and perform a 
neighbour search by walking the tree. A cell is 'opened' (i.e. 
further followed) if it has a spatial overlap with the rectan- 
gular search range. Testing for such an overlap is faster with 
a rectangular search range than with a spherical one, so we 
inscribe the spherical search region into a little cube for the 
purpose of this walk. If one arrives at a cell with only one 
particle, this is added to the interaction list if it lies inside the 
search radius. We also terminate a tree walk along a branch, 
if the cell lies completely inside the search range. Then all the 
particles in the cell can be added to the interaction list, with- 
out checking any of them for overlap with the search range 
any more. The particles in the cell can be retrieved quickly 
by means of a link-list, which can be constructed along with 
the tree and allows a retrieval of all the particles that lie in- 
side a given cell, just as it is possible in the coarse-binning 
approach. Since this short-cut reduces the length of the tree 
walk and the number of required checks for range overlap, the 
speed of the algorithm is increased by a significant amount. 

With a slight modification of the tree walk, one can also 
find all particles with |r.y| < max(/ij, hj). For this pur- 
pose, we store in each tree node the maximum SPH smooth- 
ing length occurring among its particles. The test for over- 
lap is then simply done between a cube of side-length 
max(/ii, hnode) centered on the particle i and the node itself, 
where ft. n odo is the maximum smoothing length among the 
particles of the node. 

There remains the task to keep the number of neighbours 
around a given SPH particle approximately (or exactly) con- 
stant. We solve this by predicting a value hi for the smoothing 
length based on the length hi of the previous timestep, the ac- 
tual number of neighbours Ni at that timestep, and the local 
velocity divergence: 



(old) 



1+ ^ 



1/3' 



hi At, 



(31) 



where hi = |/ii(V • v)^, and N s is the desired number 
of neighbours. A similar form for updating the smoothing 
lengths has been used by frlernquist & Katz (1989 ), see also 
Thacker et al. (2000) for a discussion of alternative choices. 



The term in brackets tries to bring the number of neighbours 
back to the desired value if Ni deviates from it. Should the re- 
sulting number of neighbours nevertheless fall outside a pre- 



scribed range of tolerance, we iteratively adjust hi until the 
number of neighbours is again brought back to the desired 
range. Optionally, our code allows the user to impose a min- 
imum smoothing length for SPH, typically chosen as some 
fraction of the gravitational softening length. A larger num- 
ber of neighbours than N s is allowed to occur if hi takes on 
this minimum value. 

One may also decide to keep the number of neighbours 
exactly constant by defining hi to be the distance to the N s - 
nearest particle. We employ such a scheme in the serial code. 
Here we carry out a range-search with R = 1.2hi, on aver- 
age resulting in ~ 2N S potential neighbours. From these we 
select the closest N s (fast algorithms for doing this exist, see 
^ress et al. 1995 ). If there are fewer than N s particles in the 
search range, or if the distance of the Af s -nearest particle in- 
side the search range is larger than R, the search is repeated 
for a larger search range. In the first timestep no previous hi 
is known, so we follow the neighbour tree backwards from 
the leaf of the particle under consideration, until we obtain a 
first reasonable guess for the local particle density (based on 
the number N of particles in a node of volume ^ 3 ), providing 
an initial guess for hi. 

However, the above scheme for keeping the number of 
neighbours exactly fixed is not easily accommodated in our 
parallel SPH implementation, because SPH particles may 
have a search radius that overlaps with several processor do- 
mains. In this case, the selection of the closest N s neigh- 
bours becomes non-trivial, because the underlying data is dis- 
tributed across several independent processor elements. For 
parallel SPH, we therefore revert to the simpler scheme and 
allow the number of neighbours to fluctuate within a small 
band. 

5. Time integration 

As a time integrator, we use a variant of the leapfrog in- 
volving an explicit prediction step. The latter is introduced 
to accommodate individual particle timesteps in the N-body 
scheme, as explained later on. 

We start by describing the integrator for a single particle. 
First, a particle position at the middle of the timestep At is 
predicted according to 



At 

T' 



(32) 



and an acceleration based on this position is computed, viz. 

a (»+i) = - V$| f( „ + i, . (33) 
Then the particle is advanced according to 

v (" +1 ) = v ( n) + a^+^At, (34) 



? (n+l) _ r (n) + 



1 



» 



At. 



(35) 
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5.1. Timestep criterion 

In the above scheme, the timestep may vary from step to step. 
It is clear that the choice of timestep size is very important 
in determining the overall accuracy and computational effi- 
ciency of the integration. 

In a static potential $, the error in specific energy arising 
in one step with the above integrator is 



AE 



1 <9 2 $ 



Adxidx- 1 3 



(«)„(«+£) A +3 



At J + 



(36) 



1 



3 



24 dxidxjdxk 



(n) (n) (n) A ,3 

v\ 'vj 'v\ 'At° 



-0(At 4 



to leading order in At, i.e. the integrator is second order ac- 
curate. Here the derivatives of the potential are taken at coor- 
dinate r(") and summation over repeated coordinate indices 
is understood. 

In principle, one could try to use equation ([36]) directly to 
obtain a timestep by imposing some upper limit on the tol- 
erable error AE. However, this approach is quite subtle in 
practice. First, the derivatives of the potential are difficult 
to obtain, and second, there is no explicit guarantee that the 
terms of higher order in At are really small. 

High-order Hermite schemes use timestep criteria that in- 
clude the first and second time derivative of the acceleration 



(e.g. [Making 1991b| ; |Makino & Aarseth 1992| ). While these 
timestep criteria are highly successful for the integration of 
very nonlinear systems, they are probably not appropriate for 
our low-order scheme, apart from the fact that substantial 
computational effort is required to evaluate these quantities 
directly. Ideally, we therefore want to use a timestep crite- 
rion that is only based on dynamical quantities that are either 
already at hand or are relatively cheap to compute. 

Note that a well known problem of adaptive timestep 
schemes is that they will usually break the time reversibil- 
ity and symplectic nature of the simple leapfrog. As a re- 
sult, the system does not evolve under a pseudo-Hamiltonian 
any more and secular drifts in the total energy can occur. As 
Quinn et al. (1997) show, reversibility can be obtained with 
a timestep that depends only on the relative coordinates of 
particles. This is for example the case for timesteps that de- 
pend only on acceleration or on local density. However, to 
achieve reversibility the timestep needs to be chosen based on 
the state of the system in the middle of the timestep (Quinn 
et al. 1997), or on the beginning and end of the timestep (Hut 
et al. 1995). In practice, this can be accomplished by discard- 
ing trial timesteps appropriately. The present code selects the 
timestep based on the previous step and is thus not reversible 
in this way. 

One possible timestep criterion is obtained by constrain- 
ing the absolute size of the second order displacement of the 
kinetic energy, assuming a typical velocity dispersion a 2 for 
the particles, which corresponds to a scale E = a 2 for the 



typical specific energy. This results in 

a 



At = a to 



a 



(37) 



For a collisionless fluid, the velocity scale a should ideally 
be chosen as the local velocity dispersion, leading to smaller 
timesteps in smaller haloes, or more generally, in 'colder' 
parts of the fluid. The local velocity dispersion can be es- 
timated from a local neighbourhood of particles, obtained as 
in the normal SPH formalism. 

Alternatively, one can constrain the second order term in 
the particle displacement, obtaining 



At 



(38) 



Here some length scale a' tol e is introduced, which will typ- 
ically be related to the gravitational softening. This form 
has quite often been employed in cosmological simulations, 
sometimes with an additional restriction on the displacement 
of particles in the form At = de/|v|. It is unclear though 
why the timesteps should depend on the gravitational soft- 
ening length in this way. In a well-resolved halo, most or- 
bits are not expected to change much if the halo is modeled 
with more particles and a correspondingly smaller softening 
length, so it should not be necessary to increase the accuracy 
of the time integration for all particles by the same factor if 
the mass/length resolution is increased. 

For self-gravitating collisionless fluids, another plausible 
timestep criterion is based on the local dynamical time: 



At 



"tol 



(39) 



One advantage of this criterion is that it provides a monoton- 
ically decreasing timestep towards the center of a halo. On 
the other hand, it requires an accurate estimate of the local 
density, which may be difficult to obtain, especially in re- 
gions of low density. In particular, Quinn et al. (1997 ) have 
shown that haloes in cosmological simulations that contain 
only a small number of particles, about equal or less than the 
number employed to estimate the local density, are suscepti- 
ble to destruction if a timestep based on ( J39| ) is used. This is 
because the kernel estimates of the density are too small in 
this situation, leading to excessively long timesteps in these 
haloes. 

In simple test integrations of singular potentials, we have 
found the criterion ([37]) to give better results compared to the 
alternative (f3~8"|). However, neither of these simple criteria is 
free of problems in typical applications to structure forma- 
tion, as we will later show in some test calculations. In the 
center of haloes, subtle secular effects can occur under con- 
ditions of coarse integration settings. The criterion based on 
the dynamical time does better in this respect, but it does not 
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work well in regions of very low density. We thus suggest to 
use a combination of ( j37[ ) and (^9|) by taking the minimum of 
the two timesteps. This provides good integration accuracy 
in low density environments and simultaneously does well in 
the central regions of large haloes. For the relative setting 
of the dimensionless tolerance parameters we use a" ~ 3a, 
which typically results in a situation where roughly the same 
number of particles are constrained by each of the two cri- 
teria in an evolved cosmological simulation. The combined 
criterion is Galilean-invariant and does not make an explicit 
reference to the gravitational softening length employed. 

5.2. Integrator for N-body systems 

In the context of stellar dynamical integrations, individual 
particle timest eps have long been used since they were first 
introduced by Aarseth (1963 ). We here employ an integrator 
with complete ly fle xible timesteps, similar to the one used by 
|Groom (1997b and Hiotelis & Voglis (199l| ). This scheme 
differs slightly from mor e commonly employed b inary hier- 
archies of timest eps (e.g. ^[ernquist & Katz 1989 ; McMillan 
& Aarseth 1993; |Steinmetz 19%^ 

Each particle has a timestep Ati, and a current time tj, 
where its dynamical state (r^, Vi, a*) is stored. The dynamical 
state of the particle can be predicted at times t 6 [t$ ± 0.5 Ati] 
with first order accuracy. 

The next particle k to be advanced is then the one with the 
minimum prediction time defined as r p = min (t{ + 0.5 Atj). 
The time t p becomes the new current time of the system. To 
advance the corresponding particle, we first predict positions 
for all particles at time r p according to 



r, = r 



+ v 4 (r p - U). 



(40) 



Based on these positions, the acceleration of particle k at the 
middle of its timestep is calculated as 



>+|) 



V4(r<) 



a k ■ - \-/lr fc 

Position and velocity of particle k are then advanced as 



(n+l) (n) 
V k = V k 



2a 



Op - tk), 



>+!) 



= r 



(n) 



» 



(n+l) 



Op - tk), 



and its current time can be updated to 



t 



(new) 
k 



t k + 2(T p -t k ). 

, (new) 



(41) 

(42) 
(43) 

(44) 



Finally, a new timestep At; ; for the particle is estimated. 

At the beginning of the simulation, all particles start out 
with the same current time. However, since the timesteps of 
the particles are all different, the current times of the particles 
distribute themselves nearly symmetrically around the cur- 
rent prediction time, hence the prediction step involves for- 
ward and backward prediction to a similar extent. 



Of course, it is impractical to advance only a single particle 
at any given prediction time, because the prediction itself and 
the (dynamic) tree updates induce some overhead. For this 
reason we advance particles in bunches. The particles may 
be thought of as being ordered according to their prediction 
times t? = tj + iAtj. The simulation works through this 
time line, and always advances the particle with the smallest 
t?, and also all subsequent particles in the time line, until the 
first is found with 



Tp < U 



1 . 



(45) 



This condition selects a group of particles at the lower end 
of the time line, and all the particles of the group are guaran- 
teed to be advanced by at least half of their maximum allowed 
timestep. Compared to using a fixed block step scheme with 
a binary hierarchy, particles are on average advanced closer 
to their maximum allowed timestep in this scheme, which re- 
sults in a slight improvement in efficiency. Also, timesteps 
can more gradually vary than in a power of two hierarchy. 
However, a perhaps more important advantage of this scheme 
is that it makes work-load balancing in the parallel code sim- 
pler, as we will discuss in more detail later on. 

In practice, the size M of the group that is advanced at a 
given step is often only a small fraction of the total particle 
number N. In this situation it becomes important to elimi- 
nate any overhead that scales with O(N). For example, we 
obviously need to find the particle with the minimum predic- 
tion time at every timestep, and also the particles following 
it in the time line. A loop over all the particles, or a com- 
plete sort at every timestep, would induce overhead of order 
O(N) or 0(N log N), which can become comparable to the 
force computation itself if M/N <C 1. We solve this problem 
by keeping the maximum prediction times of the particles in 
an ordered binary tree (Wirth 1986) at all times. Finding the 
particle with the minimum prediction time and the ones that 
follow it are then operations of order C(log N). Also, once 
the particles have been advanced, they can be removed and 
reinserted into this tree with a cost of order (log AT). To- 
gether with the dynamic tree updates, which eliminate predic- 
tion and tree construction overhead, the cost of the timestep 
then scales as 0(M log N). 

5.3. Dynamic tree updates 

If the fraction of particles to be advanced at a given timestep 
is indeed small, the prediction of all particles and the recon- 
struction of the full tree would also lead to significant sources 
of overhead. However, as McMillan & Aarseth (1993) have 



first discussed, the geometric structure of the tree, i.e. the way 
the particles are grouped into a hierarchy, evolves only rela- 
tively slowly in time. It is therefore sufficient to reconstruct 
this grouping only every few timesteps, provided one can still 
obtain accurate node properties (center of mass, multipole 
moments) at the current prediction time. 
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We use such a scheme of dynamic tree updates by predict- 
ing properties of tree-nodes on the fly, instead of predicting 
all particles every single timestep. In order to do this, each 
node carries a center-of-mass velocity in addition to its posi- 
tion at the time of its construction. New node positions can 
then be predicted while the tree is walked, and only nodes 
that are actually visited need to be predicted. Note that the 
leaves of the tree point to single particles. If they are used 
in the force computation, their prediction corresponds to the 



ordinary prediction as outlined in equation ( 43 ) 



In our simple scheme we neglect a possible time variation 
of the quadrupole mome nt of the nodes, which can b e taken 
into account in principle ( McMillan & Aarseth 1993 ). How- 
ever, we introduce a mechanism that reacts to fast time vari- 
ations of tree nodes. Whenever the center-of mass of a tree 
node under consideration has moved by more than a small 
fraction of the nodes' side-length since the last reconstruction 
of this part of the tree, the node is completely updated, i.e. the 
center-of-mass, center-of-mass velocity and quadrupole mo- 
ment are recomputed from the individual (predicted) phase- 
space variables of the particles. We also adjust the side-length 
of the tree node if any of its particles should have left its orig- 
inal cubical volume. 

Finally, the full tree is reconstructed from scratch every 
once in a while to take into account the slow changes in the 
grouping hierarchy. Typically, we update the tree whenever a 
total of ~ 0.1 AT force computations have been done since the 
last full reconstruction. With this criterion the tree construc- 
tion is an unimportant fraction of the total computation time. 
We have not noticed any significant loss of force accuracy 
induced by this procedure. 

In summary, the algorithms described above result in an 
integration scheme that can smoothly and efficiently evolve 
an N-body system containing a large dynamic range in time 
scales. At a given timestep, only a small number M of par- 
ticles are then advanced, and the total time required for that 
scales as 0(M log AT). 

5.4. Including SPH 

The above time integration scheme may easily be extended 
to include SPH. Here we also need to integrate the internal 
energy equation, and the particle accelerations also receive a 
hydrodynamical component. To compute the latter we also 
need predicted velocities 

v; = v; + ai_i(r p - ti), (46) 

where we have approximated with the acceleration of the 
previous timestep. Similarly, we obtain predictions for the 
internal energy 

Ui = m +Ui(r p - ti), (47) 
and the density of inactive particles as 

Pi= pi+ pi(r p -ti). (48) 



For those particles that are to be advanced at the current 
system step, these predicted quantities are then used to com- 
pute the hydrodynamical part of the acceleration and the rate 
of change of internal energy with the usual SPH estimates, as 
described in Section 

For hydrodynamical stability, the collisionless timestep 
criterion needs to be supplemented with the Courant condi- 
tion. We adopt it for the gas particles in the form 



At, 



Jn|(V-v),| 



c(ci, |vi|) (1 + 0.6 a visc )' 



(49) 



where a v i sc regulates the strength of the artificial bulk vis- 
cosity, and acour is an accuracy parameter, the Courant fac- 
tor. Note that we use the maximum of the sound speed a 
and the bulk velocity |vf| in this expression. This improves 
the handling of strong shocks when the infalling material is 
cold, but has the disadvantage of not being Galilean invari- 
ant. For the SPH-particles, we use either the adopted criterion 
for collisionless particles or (j49|), whichever gives the smaller 
timestep. 

As defined above, we evaluate a gas and u at the middle of 
the timestep, when the actual timestep At of the particle that 
will be advanced is already set. Note that there is a term in 
the artificial viscosity that can cause a problem in this explicit 
integration scheme. The second term in equation ([Z7|) tries to 
prevent particle inter-penetration. If a particle happens to get 
relatively close to another SPH particle in the time At/2 and 
the relative velocity of the approach is large, this term can 
suddenly lead to a very large repulsive acceleration a vlsc , try- 
ing to prevent the particles from getting any closer. However, 
it is then too late to reduce the timestep. Instead, the veloc- 
ity of the approaching particle will be changed by a vlsc At, 
possibly reversing the approach of the two particles. But the 
artificial viscosity should at most halt the approach of the par- 
ticles. To guarantee this, we introduce an upper cut-off to the 
maximum acceleration induced by the artificial viscosity. If 
Vij ■ Yij < 0, we replace equation (E6b with 



n, 



\{fi + fj) min 



n, 



(rrii 



WijAt 



(50) 



where W i3 = v lj V l [W(rij;h t ) + W{v i:j ;hj)} /2. With this 
change, the integration scheme still works reasonably well 
in regimes with strong shocks under conditions of relatively 
coarse timestepping. Of course, a small enough value of the 
Courant factor will prevent this situation from occurring to 
begin with. 

Since we use the gravitational tree of the SPH particles for 
the neighbour search, another subtlety arises in the context 
of dynamic tree updates, where the full tree is not necessar- 
ily reconstructed every single timestep. The range searching 
technique relies on the current values of the maximum SPH 
smoothing length in each node, and also expects that all par- 
ticles of a node are still inside the boundaries set by the side- 
length of a node. To guarantee that the neighbour search will 
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always give correct results, we perform a special update of the 
SPH-tree every timestep. It involves a loop over every SPH 
particle that checks whether the particle's smoothing length 
is larger than h max stored in its parent node, or if it falls out- 
side the extension of the parent node. If either of these is 
the case, the properties of the parent node are updated ac- 
cordingly, and the tree is further followed 'backwards' along 
the parent nodes, until each node is again fully 'contained' 
in its parent node. While this update routine is very fast in 
general, it does add some overhead, proportional to the num- 
ber of SPH particles, and thus breaks in principle the ideal 
scaling (proportional to M) obtained for purely collisionless 
simulations. 

5.5. Implementation of cooling 

When radiative cooling is included in simulations of galaxy 
formation or galaxy interaction, additional numerical prob- 
lems arise. In regions of strong gas cooling, the cooling times 
can become so short that extremely small timesteps would be 
required to follow the internal energy accurately with the sim- 
ple explicit integration scheme used so far. 

To remedy this problem, we treat the cooling semi- 
implicitly in an isochoric approximation. At any given 
timestep, we first compute the rate zi ad of change of the in- 
ternal energy due to the ordinary adiabatic gas physics. In an 
isochoric approximation, we then solve implicitly for a new 
internal energy predicted at the end of the timestep, i.e. 



-fn+l) (n) 

u) — u) + u 



ad Ai 



A 



In) ~(n+l) 
Pi > U 



At 



P 



(n) 



(51) 



The implicit computation of the cooling rate guarantees sta- 
bility. Based on this estimate, we compute an effective rate 
of change of the internal energy, which we then take as 



(ri+l) (n) / A , 

u) — u) /At . 



(52) 



We use this last step because the integration scheme requires 
the possibility to predict the internal energy at arbitrary times. 
With the above procedure, m is always a continuous function 
of time, and the prediction of m may be done for times in be- 
tween the application of the isochoric cooling/heating. Still, 
there can be a problem with predicted internal energies in 
cases when the cooling time is very small. Then a particle can 
lose a large fraction of its energy in a single timestep. While 
the implicit solution will still give a correct result for the tem- 
perature at the end of the timestep, the predicted energy in the 
middle of the next timestep could then become very small or 
even negative because of the large negative value of u. We 
therefore restrict the maximum cooling rate such that a parti- 
cle is only allowed to lose at most half its internal energy in a 
given timestep, preventing the predicted energies from 'over- 
shooting'. Katz & Gunn (1991 ) have used a similar method 
to damp the cooling rate. 



5.6. Integration in comoving coordinates 

For simulations in a cosmological context, the expansion of 
the universe has to be taken into account. Let x denote co- 
moving coordinates, and a be the dimensionless scale factor 
(a = 1.0 at the present epoch). Then the Newtonian equation 
of motion becomes 



x + 2-x 

a 



-G 



^(x'Hx-x')^ 



<-'|3 



(53) 



Here the function dp(x) = p(x) — p denotes the (proper) 
density fluctuation field. 

In an N-body simulation with periodic boundary condi- 
tions, the volume integral of equation (^3j) is carried out over 
all space. As a consequence, the homogeneous contribution 
arising from p drops out around every point. Then the equa- 
tion of motion of particle i becomes 



2-x, 

a 
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^3 
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(54) 



where the summation includes all periodic images of the par- 
ticles j. 

However, one may also employ vacuum boundary condi- 
tions if one simulates a spherical region of radius R around 
the origin, and neglects density fluctuations outside this re- 
gion. In this case, the background density p gives rise to an 
additional term, viz. 



> • 1 

2-Xi = -r 



&i 1 yl 



(55) 



GADGET supports both periodic and vacuum boundary con- 
ditions. We implement the former by means of the Ewald 



summation technique (Hernquist et al. 1991 ) 



For this purpose, we modify the tree walk such that each 
node is mapped to the position of its nearest periodic image 
with respect to the coordinate under consideration. If the mul- 
tipole expansion of the node can be used according to the cell 
opening criterion, its partial force is computed in the usual 
way. However, we also need to add the force exerted by all 
the other periodic images of the node. The slowly converg- 
ing sum over these contributions can be evaluated with the 
Ewald technique. If x is the coordinate of the point of force- 
evaluation relative to a node of mass M, the resulting addi- 
tional acceleration is given by 



erfc(a|x — nL\) 




y- 



12 IVJ2 



exp 
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(56) 
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Here n and h are integer triplets, L is the box size, and a is an 
arbitrary number ( |Hernquist et al. 1991 ). Good convergence 
is achieved for a = 2/L, where we sum over the range |n| < 
5 and |h| < 5. Similarly, the additional potential due to the 
periodic replications of the node is given by 



c (x) = Mi — h 



a 2 L 3 



y-v erfc(a|x — nL|) 



L ^ Trlhl 2 
h^o 1 1 



exp 



2 L 2 




(57) 



We follow Hernquist et al. (1991) and tabulate the correc- 
tion fields a c (x) /M and c (x) /M for one octant of the sim- 
ulation box, and obtain the result of the Ewald summation 
during the tree walk from trilinear interpolation off this grid. 
It should be noted however, that periodic boundaries have a 
strong impact on the speed of the tree algorithm. The number 
of floating point operations required to interpolate the correc- 
tion forces from the grid has a significant toll on the raw force 
speed and can slow it down by almost a factor of two. 

In linear theory, it can be shown that the kinetic energy 



T 



(58) 



in peculiar motion grows proportional to a, at least at early 
times. This implies that m^x 2 oc 1/a, hence the co- 
moving velocities x = v/a actually diverge for a — ► 0. 
Since cosmological simulations are usually started at redshift 
z ~ 30 — 100, one therefore needs to follow a rapid deceler- 
ation of x at high redshift. So it is numerically unfavourable 
to solve the equations of motion in the variable x. 

To remedy this problem, we use an alternative velocity 
variable 

weijIx, (59) 

and we employ the expansion factor itself as time variable. 
Then the equations of motion become 



dw 

da 



3 w 
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a 2 S(a) 



dx w 

da S(a) ' 

with S(a) = a^H(a) given by 

5(a) = H Q ^tt Q + a(l - fl - f2 A ) + a 3 A . 



(60) 
(61) 



(62) 



Note that for periodic boundaries the second term in the 
square bracket of equation ( |(50[ ) is absent, instead the sum- 
mation extends over all periodic images of the particles. 

Using the Zel'dovich approximation, one sees that w re- 
mains constant in the linear regime. Strictly speaking this 



holds at all times only for an Einstein-de-Sitter universe, 
however, it is also true for other cosmologies at early times. 
Hence equations (^9|) to ( |62"| ) in principle solve linear theory 
for arbitrarily large steps in a. This allows to traverse the lin- 
ear regime with maximum computational efficiency. Further- 
more, equations ( p9| ) to (^2|) represent a convenient formula- 
tion for general cosmologies, and for our variable timestep in- 
tegrator. Since w does not vary in the linear regime, predicted 
particle positions based on x,j = Xj + w.j(a p — a,)/iS(a p ) are 
quite accurate. Also, the acceleration entering the timestep 
criterion may now be identified with dw / da, and the timestep 
becomes 

dw 

Aa = a ta \(J 



da 



(63) 



The above equations only treat the gravity part of the dy- 
namical equations. However, it is straightforward to express 
the hydrodynamical equations in the variables (x, w, a) as 
well. For gas particles, equation ( |60| ) receives an additional 
contribution due to hydrodynamical forces, viz. 



dw\ 



1 V X P 



da J hydro aS ( a ) 

For the energy equation, one obtains 

du _ 3P 1 P 
da a p S(a) p 



(64) 



(65) 



Here the first term on the right hand side describes the adia- 
batic cooling of gas due to the expansion of the universe. 

6. Parallelization 

Massively parallel computer systems with distributed mem- 
ory have become increasingly popular recently. They can 
be thought of as a collection of workstations, connected by 
a fast communication network. This architecture promises 
large scalability for reasonable cost. Current state-of-the art 
machines of this type include the Cray T3E and IBM SP/2. 
It is an interesting development that 'Beowulf -type systems 
based on commodity hardware have started to offer floating 
point performance comparable to these supercomputers, but 
at a much lower price. 

However, an efficient use of parallel distributed memory 
machines often requires substantial changes of existing algo- 
rithms, or the development of completely new ones. Concep- 
tually, parallel programming involves two major difficulties 
in addition to the task of solving the numerical problem in 
a serial code. First, there is the difficulty of how to divide 
the work and data evenly among the processors, and second, 
an efficient communication scheme between the processors 
needs to be devised. 

In recent years, a number of groups have developed par- 
allel N-body codes, all of them with different paralleliza- 
tion strategies, and different strengths and weaknesses. Early 
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Figure 2: Schematic representation of the domain decomposition in two dimensions, and for four processors. Here, the first split occurs along 
the y-axis, separating the processors into two groups. They then independently carry out a second split along the x-axis. After completion of 
the domain decomposition, each processor element (PE) can construct its own BH tree just for the particles in its part of the computational 
domain. 



versions of parallel codes include those of Barnes (1986) 
Makino & Hut (1989| ) and [Theuns & Rathsack (1993[ hLatei- 
Warren et al. (1992) parallelized the BH-tree code on mas- 



sivel y parallel machines with distributed memory. D ubinski 
(1996) presented the first parallel tree code based on MPI. 
frjikaiakos & Stadel (1996 ) have developed a parallel simu- 
lation code (PKDGRAV) that works with a balanced binary 
tree. Mor e recently, parallel tree- SPH codes have been intro- 
duced by [Dave et al' (1997| ) and pa & Carraro (2000| ), and 
a PVM impl ementation of a gravity-only tree code has been 
described by Viturro & Carpintero (2000). 



We here report on our newly developed parallel version of 
GADGET, where we use a parallelization strategy that differs 
from previous workers. It also implements individual particle 
timesteps for the first time on distributed-memory, massively 
parallel computers. We have used the Message Passing Inter- 
face (MPI) ( [Snir et al. 1995| ; ^acheco 1997] ), which is an ex- 
plicit communication scheme, i.e. it is entirely up to the user 
to control the communication. Messages containing data can 
be sent between processors, both in synchronous and asyn- 
chronous modes. A particular advantage of MPI is its flexi- 
bility and portability. Our simulation code uses only standard 
C and standard MPI, and should therefore run on a variety of 
platforms. We have confirmed this so far on Cray T3E and 
IBM SP/2 systems, and on Linux-PC clusters. 



6.1. Domain decomposition 

The typical size of problems attacked on parallel computers 
is usually much too large to fit into the memory of individ- 
ual computational nodes, or into ordinary workstations. This 
fact alone (but of course also the desire to distribute the work 
among the processors) requires a partitioning of the problem 
onto the individual processors. 

For our N-body/SPH code we have implemented a spatial 
domain decomposition, using the orthog onal recursive bisec- 
tion (ORB) algorithm ( Dubinski 1996 ). In the first step, a 
split is found along one spatial direction, e.g. the x-axis, and 
the collection of processors is grouped into two halves, one 
for each side of the split. These processors then exchange 
particles such that they end up hosting only particles lying 
on their side of the split. In the simplest possible approach, 
the position of the split is chosen such that there are an equal 
number of particles on both sides. However, for an efficient 
simulation code the split should try to balance the work done 
in the force computation on the two sides. This aspect will be 
discussed further below. 

In a second step, each group of processors finds a new split 
along a different spatial axis, e.g. the y-axis. This splitting 
process is repeated recursively until the final groups consist 
of just one processor, which then hosts a rectangular piece 
of the computational volume. Note that this algorithm con- 
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Figure 3: Schematic illustration of the parallelization scheme of GADGET for the force computation. In the first step, each PE identifies 
the active particles, and puts their coordinates in a communication buffer. In a communication phase, a single and identical list of all these 
coordinates is then established on all processors. Then each PE walks its local tree for this list, thereby obtaining a list of partial forces. These 
are then communicated in a collective process back to the original PE that hosts the corresponding particle coordinate. Each processor then 
sums up the incoming force contributions, and finally arrives at the required total forces for its active particles. 



strains the number of processors that may be used to a power 
of two. Other algorithms for the domain decomposition, for 



example Voronoi tessellations (Yahagi et al. 1999), are free 
of this restriction. 

A two-dimensional schematic illustration of the ORB is 
shown in Figure ||. Each processor can construct a local BH 
tree for its domain, and this tree may be used to compute 
the force exerted by the processors' particles on arbitrary test 
particles in space. 

6.2. Parallel computation of the gravitational force 

GADGET'S a lgorithm for par allel force computation differs 
from that of Dubinski (199(j ), who introduced the notion of 
locally essential trees. These are trees that are sufficiently 
detailed to allow the full force computation for any parti- 
cle local to a processor, without further need for information 
from other processors. The locally essential trees can be con- 
structed from the local trees by pruning and exporting parts 
of these trees to other processors, and attaching these parts as 
new branches to the local trees. To determine which parts of 
the trees need to be exported, special tree walks are required. 

A difficulty with this technique occurs in the context of 
dynamic tree updates. While the additional time required to 
promote local trees to locally essential trees should not be 
an issue for an integration scheme with a global timestep, 
it can become a significant source of overhead in individual 



timestep schemes. Here, often only one per cent or less of 
all particles require a force update at one of the (small) sys- 
tem timesteps. Even if a dynamic tree update scheme is used 
to avoid having to reconstruct the full tree every timestep, 
the locally essential trees are still confronted with subtle syn- 
chronization issues for the nodes and particles that have been 
imported from other processor domains. Imported particles 
in particular may have received force computations since the 
last 'full' reconstruction of the locally essential tree occurred, 
and hence need to be re-imported. The local domain will 
also lack sufficient information to be able to update imported 
nodes on its own if this is needed. So some additional com- 
munication needs to occur to properly synchronize the locally 
essential trees on each timestep. 'On-demand' schemes, in- 
volving asynchronous communication, may be the best way 
to accomplish this in practice, but they will still add some 
overhead and are probably quite complicated to implement. 
Also note that the construction of locally essential trees de- 
pends on the opening criterion. If the latter is not purely ge- 
ometric but depends on the particle for which the force is 
desired, it can be difficult to generate a fully sufficient lo- 
cally essential tree. For these reasons we chose a different 
parallelization scheme that scales linearly with the number of 
particles that need a force computation. 

Our strategy starts from the observation that each of the 
local processor trees is able to provide the force exerted by 
its particles for any location in space. The full force might 
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thus be obtained by adding up all the partial forces from the 
local trees. As long as the number of these trees is less than 
the number of typical particle-node interactions, this compu- 
tational scheme is practically not more expensive than a tree 
walk of the corresponding locally essential tree. 

A force computation therefore requires a communication 
of the desired coordinates to all processors. These then walk 
their local trees, and send partial forces back to the original 
processor that sent out the corresponding coordinate. The to- 
tal force is then obtained by summing up the incoming con- 
tributions. 

In practice, a force computation for a single particle would 
be badly imbalanced in work in such a scheme, since some 
of the processors could stop their tree walk already at the 
root node, while others would have to evaluate several hun- 
dred particle-node interactions. However, the time integra- 
tion scheme advances at a given timestep always a group of 
M particles of size about 0.5-5 per cent of the total num- 
ber of particles. This group represents a representative mix 
of the various clustering states of matter in the simulation. 
Each processor contributes some of its particle positions to 
this mix, but the total list of coordinates is the same for all 
processors. If the domain decomposition is done well, one 
can arrange that the cumulative time to walk the local tree 
for all coordinates in this list is the same for all processors, 
resulting in good work-load balance. In the time integration 
scheme outlined above, the size M of the group of active 
particles is always roughly the same from step to step, and 
it also represents always the same statistical mix of particles 
and work requirements. This means that the same domain 
decomposition is appropriate for each of a series of consec- 
utive steps. On the other hand, in a block step scheme with 
binary hierarchy, a step where all particles are synchronized 
may be followed by a step where only a very small fraction 
of particles are active. In general, one cannot expect that the 
same domain decomposition will balance the work for both 
of these steps. 

Our force computation scheme proceeds therefore as 
sketched schematically in Figure |3| Each processor identifies 
the particles that are to be advanced in the current timestep, 
and puts their coordinates in a communication buffer. Next, 
an all-to-all communication phase is used to establish the 
same list of coordinates on all processors. This communi- 
cation is done in a collective fashion: For N p processors, 
the communication involves N p — 1 cycles. In each cy- 
cle, the processors are arranged in N p /2 pairs. Each pair 
exchanges their original list of active coordinates. While 
the amount of data that needs to be communicated scales as 
0[M(N p - 1)] ~ 0(MN p ), the wall-clock time required 
scales only as 0(M + cN p ) because the communication is 
done fully in parallel. The term c N p describes losses due to 
message latency and overhead due to the message envelopes. 
In practice, additional losses can occur on certain network 
topologies due to message collisions, or if the particle num- 



bers contributed to M by the individual processors are signif- 
icantly different, resulting in communication imbalance. On 
the T3E, the communication bandwidth is large enough that 
only a very small fraction of the overall simulation time is 
spent in this phase, even if processor partitions as large as 
512 are used. On Beowulf-class networks of workstations, 
we find that typically less than about 10-20% of the time is 
lost due to communication overhead if the computers are con- 
nected by a switched network with a speed of 100 Mbit s _1 . 

In the next step, all processors walk their local trees and 
replace the coordinates with the corresponding force contri- 
bution. Note that this is the most time-consuming step of 
a collisionless simulation (as it should be), hence work-load 
balancing is most crucial here. After that, the force contri- 
butions are communicated in a similar way as above between 
the processor pairs. The processor that hosted a particular co- 
ordinate adds up the incoming force contributions and finally 
ends up with the full force for that location. These forces 
can then be used to advance its locally active particles, and to 
determine new timesteps for them. In these phases of the N- 
body algorithm, as well as in the tree construction, no further 
information from other processors is required. 

6.3. Work-load balancing 

Due to the high communication bandwidth of parallel super- 
computers like the T3E or the SP/2, the time required for 
force computation is dominated by the tree walks, and this 
is also the dominating part of the simulation as a whole. It 
is therefore important that this part of the computation paral- 
lelizes well. In the context of our parallelization scheme, this 
means that the domain decomposition should be done such 
that the time spent in the tree walks of each step is the same 
for all processors. 

It is helpful to note, that the list of coordinates for the tree 
walks is independent of the domain decomposition. We can 
then think of each patch of space, represented by its particles, 
to cause some cost in the tree-walk process. A good measure 
for this cost is the number of particle-node interactions orig- 
inating from this region of space. To balance the work, the 
domain decomposition should therefore try to make this cost 
equal on the two sides of each domain split. 

In practice, we try to reach this goal by letting each tree- 
node carry a counter for the number of node-particle inter- 
actions it participated in since the last domain decomposi- 
tion occurred. Before a new domain decomposition starts, 
we then assign this cost to individual particles in order to ob- 
tain a weight factor reflecting the cost they on average incur 
in the gravity computation. For this purpose, we walk the tree 
backwards from a leaf (i.e. a single particle) to the root node. 
In this walk, the particle collects its total cost by adding up 
its share of the cost from all its parent nodes. The computa- 
tion of these cost factors differs somewhat from the method 
of pubinski (1996 ), but the general idea of such a work-load 
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balancing scheme is similar. 

Note that an optimum work-load balance can often result in 
substantial memory imbalance. Tree-codes consume plenty 
of memory, so that the feasible problem size can become 
memory rather than CPU-time limited. For example, a sin- 
gle node with 128 Mbyte on the Garching T3E is already 
filled to roughly 65 per cent with 4.0xl0 5 particles by the 
present code, including all memory for the tree structures, and 
the time integration scheme. In this example, the remaining 
free memory can already be insufficient to allow an optimum 
work-load balance in strongly clustered simulations. Unfor- 
tunately, such a situation is not untypical in practice, since 
one usually strives for large N in N-body work, so one is al- 
ways tempted to fill up most of the available memory with 
particles, without leaving much room to balance the work- 
load. Of course, GADGET can only try to balance the work 
within the constraints set by the available memory. The code 
will also strictly observe a memory limit in all intermediate 
steps of the domain decomposition, because some machines 
(e.g. T3E) do not have virtual memory. 

6.4. Parallelization of SPH 

Hydrodynamics can be seen as a more natural candidate for 
parallelization than gravity, because it is a local interaction. 
In contrast to this, the gravitational N-body problem has the 
unpleasant property that at all times each particle interacts 
with every other particle, making gravity intrinsically diffi- 
cult to parallelize on distributed memory machines. 

It therefore comes as no large surprise that the paralleliza- 
tion of SPH can be handled by invoking the same strategy we 
employed for the gravitational part, with only minor adjust- 
ments that take into account that most SPH particles (those 
entirely 'inside' a local domain) do not rely on information 
from other processors. 

In particular, as in the purely gravitational case, we do not 
import neighbouring particles from other processors. Rather, 
we export the particle coordinates (and other information, if 
needed) to other processors, which then deliver partial hy- 
drodynamical forces or partial densities contributed by their 
particle population. For example, the density computation 
of SPH can be handled in much the same way as that of the 
gravitational forces. In a collective communication, the pro- 
cessors establish a common list of all coordinates of active 
particles together with their smoothing lengths. Each proces- 
sor then computes partial densities by finding those of its par- 
ticles that lie within each smoothing region, and by doing the 
usual kernel weighting for them. These partial densities are 
then delivered back to the processor that holds the particle 
corresponding to a certain entry in the list of active coordi- 
nates. This processor then adds up the partial contributions, 
and obtains the full density estimate for the active particle. 

The locality of the SPH interactions brings an important 
simplification to this scheme. If the smoothing region of a 



particle lies entirely inside a local domain, the particle co- 
ordinate does not have to be exported at all, allowing a large 
reduction in the length of the common list of coordinates each 
processor has to work on, and reducing the necessary amount 
of communication involved by a large factor. 

In the first phase of the SPH computation, we find in this 
way the density and the total number of neighbours inside the 
smoothing length, and we evaluate velocity divergence and 
curl. In the second phase, the hydrodynamical forces are then 
evaluated in an analogous fashion. 

Notice that the computational cost and the communication 
overhead of this scheme again scale linearly with the number 
M of active particles, a feature that is essential for good adap- 
tivity of the code. We also note that in the second phase of 
the SPH computation, the particles 'see' the updated density 
values automatically. When SPH particles themselves are im- 
ported to form a locally essential SPH-domain, they have to 
be exchanged a second time if they are active particles them- 
selves in order to have correctly updat ed density values for 
the hydrodynamical force computation ( Dave et al. 1997 ). 

An interesting complication arises when the domain de- 
composition is not repeated every timestep. In practice this 
means that the boundaries of a domain may become 'diffuse' 
when some particles start to move out of the boundaries of 
the original domain. If the spatial region that we classified 
as interior of a domain is not adjusted, we then risk missing 
interactions because we might not export a particle that starts 
to interact with a particle that has entered the local domain 
boundaries, but is still hosted by another processor. Note 
that our method to update the neighbour-tree on each sin- 
gle timestep already guarantees that the correct neighbour- 
ing particles are always found on a local processor - if such 
a search is conducted at all. So in the case of soft domain 
boundaries we only need to make sure that all those particles 
are really exported that can possibly interact with particles on 
other processors. 

We achieve this by 'shrinking' the interior of the local do- 
main. During the neighbour-tree update, we find the maxi- 
mum spatial extension of all SPH particles on the local do- 
main. These rectangular regions are then gathered from all 
other processors, and cut with the local extension. If an over- 
lap occurs the local 'interior' is reduced accordingly, thereby 
resulting in a region that is guaranteed to be free of any parti- 
cle lying on another processor. In this way, the SPH algorithm 
will produce correct results, even if the domain decomposi- 
tion is repeated only rarely, or never again. In the case of 
periodic boundaries, special care has to be taken in treating 
cuts of the current interior with periodic translations of the 
other domain extensions. 



7. Results and tests 
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Figure 4: NFW density profiles used in the test of time integration 
criteria. The solid line gives the shape of the normal (unsoftened) 
NFW profile, dashed lines are softened according to e = 0.0004, 
0.004, and 0.04, respectively (see equation H). 



7.1 . Tests of timestep criteria 

Orbit integration in the collisionless systems of cosmologi- 
cal simulations is much less challenging than in highly non- 
linear systems like star clusters. As a consequence, all the 
simple timestep criteria discussed in Section 5.1 are capable 
of producing accurate results for sufficiently small settings 
of their tolerance parameters. However, some of these cri- 
teria may require a substantially higher computational effort 
to achieve a desired level of integration accuracy than oth- 
ers, and the systematic deviations under conditions of coarse 
timestepping may be more severe for certain criteria than for 
others. 

In order to investigate this issue, we have followed the 
orbits of the partic l e dist ribution of a typical NFW-halo 
( Navarro et al. 1996 , 1997 ) for one Hubble time. The initial 
radial particle distribution and the initial velocity distribution 
were taken from a halo of circular velocity 1350 kms -1 iden- 
tified in a cosmological N-body simulation and well fit by the 
NFW profile. However, we now evolved the particle distri- 
bution using a fixed softened potential, corresponding to the 
mass profile 



p{r) = 



^cPcrit. 



(66) 



fitted to the initial particle distribution. We use a static poten- 
tial in order to be able to isolate properties of the time integra- 



tion only. As a side-effect this also allows a quick integration 
of all orbits for a Hubble time, and the analytic formulae for 
the density and potential can also be used to check energy 
conservation for each of the particles individually. Note that 
we introduced a softening parameter e in the profile which is 
not present in the normal NFW parameterization. The halo 
had concentration c = 8 with r s = 170/i -1 kpc, and con- 
tained about 140000 particles inside the virial radius. We 
followed the orbits of all particles inside a sphere of radius 
3000 /i _1 kpc around the halo center (about 200000 in total) 
using GADGET'S integration scheme for single particles in 
haloes of three different softening lengths, e = 0.0004, 0.004, 
and 0.04. The corresponding density profiles are plotted in 
Figure 0. 

In Figure we give the radii of shells enclosing fixed 
numbers of particles as a function of the mean number of 
force evaluations necessary to carry out the integration for 
one Hubble time. The Figure thus shows the convergence of 
these radii when more and more computational effort is spent. 

The results show several interesting trends. In general, 
for poor integration accuracy (on the left), all radii are much 
larger than the converged values, i.e. the density profile shows 
severe signs of relaxation even at large distances from the 
center. When an integrator with fixed timestep is used (solid 
lines), the radii decrease monotonically when more force 
evaluations, i.e. smaller timesteps, are taken, until conver- 
gence to asymptotic values is reached. The other timestep cri- 
teria converge to these asymptotic values much more quickly. 
Also, for small softening values, the timestep criterion At oc 
|a| _1 (dashed lines) performs noticeably better than the one 
with t oc |a| -1 / 2 (dotted lines), i.e. it is usually closer to the 
converged values at a given computational cost. On the other 
hand, the criterion based on the local dynamical time (dot- 
dashed lines) does clearly the best job. It stays close to the 
converged values even at a very low number of timesteps. 

This impression is corroborated when the softening length 
is increased. The timestep based on the local dynamical time 
performs better than the other two simple criteria, and the 
fixed timestep. Interestingly, the criteria based on the accel- 
eration develop a non-monotonic behaviour when the poten- 
tial is softened. Already at e = 0.004 this is visible at the 
three lowest radii, but more clearly so for e = 0.04. Be- 
fore the radii increase when poorer integration settings are 
chosen, they actually first decrease. There is thus a regime 
where both of these criteria can lead to a net loss of energy 
for particles close to the shallow core induced by the soft- 
ening, thereby steepening it. We think that this is related to 
the fact that the timesteps actually increase towards the cen- 
ter for the criteria ([37|) and ([38]) as soon as the logarithmic 
slope of the density profile becomes shallower than -1. On 
the other hand, the density increases monotonically towards 
the center, and hence a timestep based the local dynamical 
time will decrease. If one uses the local velocity dispersion 
in the criterion (p7|), a better behaviour can be obtained, be- 
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Figure 5: Radii enclosing a fixed mass (7.1 x 1(T 5 , 3.6 x 10~ 4 , 1.8 x 10" 3 , 7.1 x 10" 3 , 3.6 x 10~ 2 , 2.1 x 10" 1 , and 7.1 x 10" 1 in units of 
the virial mass) after integrating the particle population of a NFW halo in a fixed potential for one Hubble time with various timestep criteria. 
Solid lines are for fixed timesteps for all particles, dashed lines for timesteps based on At cx |a| _1 , dotted lines for At oc |a|~ 1//2 , and 
dot-dashed for timesteps based on the local dynamical time, i.e. At oc p~ 1//2 . The three panels are for softened forms of the NFW potential 
according to equation (|66|). Note that the fluctuations of the lowest radial shell are caused by the low sampling of the halo in the core and are 
not significant. The horizontal axis gives the mean number of force evaluation per particle required to advance the particle set for one Hubble 
time. 



cause the velocity dispersion declines towards the center in 
the core region of dark matter haloes. However, this is not 
enough to completely suppress non-monotonic behaviour, as 
we have found in further sets of test calculations. 

We thus think that in particular for the cores of large haloes 
the local dynamical time provides the best of the timestep cri- 
teria tested. It leads to hardly any secular evolution of the 
density profile down to rather co arse timestepping. On the 
other hand, as Quinn et al. (1997 ) have pointed out, estimat- 
ing the density in a real simulation poses additional problems. 
Since kernel estimates are carried out over a number of neigh- 
bours, the density in small haloes can be underestimated, re- 
sulting in 'evaporation' of haloes. These haloes are better 
treated with a criterion based on local acceleration, which is 
much more accurately known in this case. 

We thus think that a conservative and accurate way to 



choose timesteps in cosmological simulations is obtained by 
taking the minimum of ([37]) and (|3^). This requires a compu- 
tation of the local matter density and local velocity dispersion 
of collisionless material. Both of these quantities can be ob- 
tained for the dark matter just as it is done for the SPH parti- 
cles. Of course, this requires additional work in order to find 
neighbours and to do the appropriate summations and kernel 
estimates, which is, however, typically not more than ~ 30% 
of the cost the gravitational force computation. However, as 
Figure |] shows, this is in general worth investing, because 
the alternative timestep criteria require more timesteps, and 
hence larger computational cost, by a larger factor in order to 
achieve the same accuracy. 
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Figure 6: Cumulative distribution of the relative force error obtained as a function of the tolerance parameter for three opening criteria, and 
for two different particle distribution. The panels in the left column show results for the particle distribution in the initial conditions of a 32 3 
simulation at z = 25, while the panels on the right give the force errors for the evolved and clustered distribution at z = 0. 
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Figure 7: Computational efficiency of three different cell-opening criteria, parameterized by their tolerance parameters. The horizontal axes 
is proportional to the computational cost, and the vertical axes gives the 95% percentile of the cumulative distribution of relative force errors 
in a 32 3 cosmological simulation. The left panel shows results for the initial conditions at z — 25, the right panel for the clustered distribution 
at redshift z = 0. 



7.2. Force accuracy and opening criterion 



In Section 3.3, we have outlined that standard values of the 
BH-opening criterion can result in very high force errors for 
the initial conditions of cosmological N-body simulations. 
Here, the density field is very close to being homogeneous, so 
that small peculiar accelerations arise from a cancellation of 
relatively large partial forces. We now investigate this prob- 
lem further using a cosmological simulation with 32 3 parti- 
cles in a periodic box. We consider the particle distribution of 
the initial conditions at redshift z = 25, and the clustered one 
of the evolved simulation at z = 0. Romeel Dave has kindly 
made these particle dumps available to us, together with exact 
forces he calculated for all particles with a direct summation 
technique, properly including the periodic images of the par- 
ticles. In the following we will compare his forces (which 
we take to be the exact result) to the ones computed by the 
parallel version of GADGET using two processors on a Linux 
PC. 

We first examine the distribution of the relative force er- 
ror as a function of the tolerance parameter used in the two 
different cell-opening criteria (^|) and (18). These results are 
shown in Figure [| where we also contrast these distributions 
with the ones obtained with an optimum cell-opening strat- 
egy. The latter may be operationally defined as follows: 

Any complete tree walk results in an interaction list that 
contains a number of internal tree nodes (whose multipole 
expansions are used) and a number of single particles (which 
give rise to exact partial forces). Obviously, the shortest pos- 
sible interaction list is that of just one entry, the root node of 



the tree itself. Suppose we start with this interaction list, and 
then open always the one node of the current interaction list 
that gives rise to the largest absolute force error. This is the 
node with the largest difference between multipole expansion 
and exact force, as resulting from direct summation over all 
the particles represented by the node. With such an opening 
strategy, the shortest possible interaction list can be found for 
any desired level of accuracy. The latter may be set by requir- 
ing that the largest force error of any node in the interaction 
list has fallen below a fraction 5|a| of the true force field. 

Of course, this optimum strategy is not practical in a real 
simulation, after all it requires the knowledge of all the ex- 
act forces for all internal nodes of the tree. If these were 
known, we wouldn't have to bother with trees to begin with. 
However, it is very interesting to find out what such a fidu- 
cial optimum opening strategy would ultimately deliver. Any 
other analytic or heuristic cell-opening criterion is bound to 
be worse. Knowing how much worse such criteria are will 
thus inform us about the maximum performance improve- 
ment possible by adopting alternative cell-opening criteria. 
We therefore computed for each particle the exact direct sum- 
mation forces exerted by each of the internal nodes, hence al- 
lowing us to perform an optimum tree walk in the above way. 
The resulting cumulative error distributions as a function of 5 
are shown in Figure [| 

Looking at this Figure, it is clear that the BH error distri- 
bution has a more pronounced tail of large force errors com- 
pared to the opening criterion (p"8|). What is most striking 
however is that at high redshift the BH force errors can be- 
come very large for values of the opening angle 9 that tend to 
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give perfectly acceptable accuracy for a clustered mass dis- 
tribution. This is clearly caused by its purely geometrical 
nature, which does not know about the smallness of the net 
forces, and thus does not invest sufficient effort to evaluate 
partial forces to high enough accuracy. 

The error distributions alone do not tell yet about the com- 
putational efficiency of the cell-opening strategies. To assess 
these, we define the error for a given cell-opening criterion 
as the 95% percentile of the error distribution, and we plot 
this versus the mean length of the interaction list per particle. 
This length is essentially linearly related to the computational 
cost of the force evaluation. 

In Figure ^, we compare the performance of the cell- 
opening criteria in this way. At high redshift, we see that the 
large errors of the BH criterion are mainly caused because the 
mean length of the interaction list remains small and does not 
adapt to the more demanding requirements of the force com- 
putation in this regime. Our 'new' cell-opening criterion does 
much better in this respect; its errors remain well controlled 
without having to adjust the tolerance parameter. 

Another advantage of the new criterion is that it is in fact 
more efficient than the BH criterion, i.e. it achieves higher 
force accuracy for a given computational expense. As Figure 
[7] shows, for a clustered particle distribution in a cosmolog- 
ical simulation, the implied saving can easily reach a factor 
2-3, speeding up the simulation by the same factor. Similar 
performance improvements are obtained for isolated galax- 
ies, as the example in Figure || demonstrates. This can be un- 
derstood as follows: The geometrical BH criterion does not 
take the dynamical significance of the mass distribution into 
account. For example, it will invest a large number of cell- 
particle interactions to compute the force exerted by a large 
void to a high relative accuracy, while actually this force is 
of small absolute size, and it would be better to concentrate 
more on those regions that provide most of the force on the 
current particle. The new opening criterion follows this latter 
strategy, improving the force accuracy at a given number of 
particle-cell interactions. 

We note that the improvement obtained by criterion ( |l8| ) 
brings us about halfway to what might ultimately be possible 
with an optimum criterion. It thus appears that there is still 
room for further improvement of the cell-opening criterion, 
even though it is clear that the optimum will likely not be 
reachable in practice. 

7.3. Colliding disk galaxies 

As a test of the performance and accuracy of the integration 
of collisionless systems, we here consider a pair of merging 
disk galaxies. Each galaxy has a massive dark halo consisting 
of 30000 particles, and an embedded stellar disk, represented 
by 20000 particles. The dark halo is modeled according to the 
NFW-profile, adiabatically modified by the central exponen- 
tial disk, which contributes 5% of the total mass. The halo 
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Figure 8: Force errors for an isolated halo/disk galaxy with the BH- 
criterion (boxes), and the new opening criterion (triangles). The dark 
halo of the galaxy is modeled with a NFW profile and is truncated at 
the virial radius. The plot shows the 90% percentile of the error dis- 
tribution, i.e. 90% of the particles have force errors below the cited 
values. The horizontal axes measures the computational expense. 



has a circular velocity W200 = 160 kms -1 , a concentration 
c = 5, and spin parameter A = 0.05. The radial exponen- 
tial scale length of the disk is = 4.5 /i _1 kpc, while the 
vertical structure is that of an isothermal sheet with thickness 
z = 0.2i?(j- The gravitational softening of the halo particles 
is set to 0.4 /i _1 kpc, and that of the disk to 0.1 /i _1 kpc. 

Initially, the two galaxies are set-up on a parabolic orbit, 
with separation such that their dark haloes just touch. Both 
of the galaxies have a prograde orientation, but are inclined 
with respect to the orbital plane. In fact, the test considered 
here corresponds exactly to the simulation 'CI' computed by 
|Springel (200C| ), where more information about the construc- 
tion of the ini tial conditions can be found (see also Springel 
& White 1999). 

We first consider a run of this model with a set of param- 
eters equal to the coarsest values we would typically employ 
for a simulation of this kind. For the time integration, we 
used the parameter a to \<T = 25 kms" 1 , and for the force 
computation with the tree algorithm, the new opening cri- 
terion with a = 0.04. The simulation was then run from 
t = to t = 2.8 in internal units of time (corresponding 
to 2.85 /i _1 Gyr). During this time the galaxies have their 
first close encounter at around t ~ 1.0, where tidal tails are 
ejected out of the stellar disks. Due to the braking by dynam- 
ical friction, the galaxies eventually fall together for a second 
time, after which they quickly merge and violently relax to 
form a single merger remnant. At t — 2.8, the inner parts of 
the merger remnant are well relaxed. 

This simulation required a total number of 4684 steps and 
27795733 force computations, i.e. a computationally equally 
expensive computation with fixed timesteps could just make 
280 full timesteps. The relative error in the total energy was 
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Figure 9: Spherically averaged density profile of the stellar compo- 
nent in the merger remnant of two colliding disk galaxies. Triangles 
show the results obtained using our variable timestep integration, 
while boxes and diamonds are for fixed timestep integrations with 
At = 0.01 (10 fc _1 Myr) and At = 0.0025 (2.5 ft _1 Myr), respec- 
tively. Note that the simulation using the adaptive timestep is about 
as expensive as the one with At = 0.01. In each case, the center 
of the remnant was defined as the position of the particle with the 
minimum gravitational potential. 



3.0 x 10~ 3 , and a Sun Ultrasparc-II workstation (sparcv9 pro- 
cessor, 296 MHz clock speed) did the simulation in 4 hours 
(using the serial code). The raw force speed was ^2800 force 
computations per second, older workstations will achieve 
somewhat smaller values, of course. Also, higher force ac- 
curacy settings will slow down the code somewhat. 

We now consider a simulation of the same system using a 
fixed timestep. For At = 0.01, the run needs 280 full steps, 
i.e. it consumes the same amount of CPU time as above. 
However, in this simulation, the error in the total energy is 
2.2%, substantially larger than before. There are also dif- 
ferences in the density profile of the merger remnants. In 
Figure |[ we compare the inner density profile of the simula- 
tion with adaptive timesteps (triangles) to the one with a fixed 
timestep of At = 0.01 (boxes). We here show the spherically 
averaged profile of the stellar component, with the center of 
the remnants defined as the position of the particle with the 
minimum gravitational potential. It can be seen that in the 
innermost ~ 1 /i _1 kpc, the density obtained with the fixed 
timestep falls short of the adaptive timestep integration. 

To get an idea how small the fixed timestep has to be to 
achieve similar accuracy as with the adaptive timestep, we 
have simulated this test a second time, with a fixed timesteps 
of At = 0.0025. We also show the corresponding profile 
(diamonds) in Figure ^. It can be seen that for smaller At, 
the agreement with the variable timestep result improves. At 
~ 2 x 0.4ft, _1 kpc, the softening of the dark matter starts 
to become important. For agreement down to this scale, the 



fixed timestep needs to be slightly smaller than At = 0.0025, 
so the overall saving due to the use of individual particle 
timesteps is at least a factor of 4 — 5 in this example. 

7.4. Collapse of a cold gas sphere 

The self-gravitating collapse of an initially isothermal, cool 
gas sphere has been a common test problem of SPH codes 
(jEvrard 1988|; [Hernquist & Katz 1989|; Steinmetz & Miiller 



1993; Thacker et al. 2000; Carrara et al. 1998,and others) 



Following these authors, we consider a spherically symmetric 
gas cloud of total mass M, radius R, and initial density profile 



p(r) 



M 1 



2ttR 2 



(67) 



We take the gas to be of constant temperature initially, with 
an internal energy per unit mass of 



u = 0.05 



GM 
~R~' 



(68) 



At the start of the simulation, the gas particles are at rest. We 
obtain their initial coordinates from a distorted regular grid 
that reproduces the density profile (|67|), and we use a system 
of units with G = M = R = 1. 

In Figure [l(| we show the evolution of the potential, the 
thermal, and the kinetic energy of the system from the start 
of the simulation at t = to t = 3. We plot results for two 
simulations, one with 30976 particles (solid), and one with 
4224 particles (dashed). During the central bounce between 
t w 0.8 and t w 1.2 most of the kinetic energy is converted 
into heat, and a strong shock wave travels outward. Later, the 
system slowly settles to virial equilibrium. 

For these runs N s was set to 40, the gravitational soften- 
ing to e = 0.02, and time integration was controlled by the 
parameters at \a = 0.05 and a CO ur = 0.1^, resulting in very 
good energy conservation. The absolute error in the total en- 
ergy is less than 1.1 x 10~ 3 at all times during the simula- 
tion, translating to a relative error of 0.23%. Since we use 
a time integration scheme with individual timesteps of arbi- 
trary size, this small error is reassuring. The total number of 
small steps taken by the 4224 particle simulation was 3855, 
with a total of 2192421 force computations, i.e. the equiva- 
lent number of 'full' timesteps was 519. A Sun Ultrasparc- 
II workstation needed 2300 seconds for the simulation. The 
larger 30976 particle run took 10668 steps, with an equiva- 
lent of 1086 'full' steps, and 12 hours of CPU time. Note 
that by reducing the time integration accuracy by a factor of 
2, with a corresponding reduction of the CPU time consump- 
tion by the same factor, the results do practically not change, 
however, the maximum error in the energy goes up to 1 .2% 
in this case. 



2 Note that our definition of the smoothing length h differs by a factor of 
2 from most previous SPH implementations. As a consequence, corre- 
sponding values of « C our are different by a factor of 2, too. 
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Figure 10: Time evolution of the thermal, kinetic, potential, and total energy for the collapse of an initially isothermal gas sphere. Solid lines 
show results for a simulations with 30976 particles, dashed lines are for a 4224 particle run. 



The results of Figure |lO| agree very well with those of 
|Steinmetz & Muller (1993fc and with [Thacker et al. (2000| ) 
if we compare to their 'bes t' implementation of artifici al vis- 
cosity (their version 12). |Steinmetz & Muller (1993b have 
also computed a solution of this problem with a very accurate, 
one-dimensional, spherically symmetric piecewise parabolic 
method (PPM). For particle numbers above 10000, our SPH 
results become very close to the finite difference result. How- 
ever, even for very small particle numbers, SPH is capable of 
reproducing the general features of the solution very well. 
We also expect that a three-dimensional PPM calculation of 
the collapse would likely require at least the same amount of 
CPU time as our SPH calculation. 



7.5. Performance and scalability of the parallel 
gravity 

We here show a simple test of the performance of the parallel 
version of the code under conditions relevant for real target 



applications. For this test, we have used a 'stripped-down' 
version of the initial conditions originally constructed for a 
high-resolution simulation of a cluster of galaxies. The orig- 
inal set of initial conditions was set-up to follow the cosmo- 
logical evolution of a large spherical region with comoving 
radius 70 /i _1 Mpc, within a ACDM cosmogony correspond- 
ing to fio = 0.3, S!a = 0.7, z st art = 50, and h = 0.7. In 
the center of the simulation sphere, 2 million high-resolution 
particles were placed in the somewhat enlarged Lagrangian 
region of the cluster. The rest of the volume was filled with 
an extended shell of boundary particles of larger mass and 
larger softening; they are needed for a proper representation 
of the gravitational tidal field. 

To keep our test simple, we have cut a centred sphere of co- 
moving radius 12.25 /i _1 Mpc from these initial conditions, 
and we only simulated the 500000 high-resolution particles 
with mass m p = 1.36 x 10 9 /i -1 Mq found within this region. 
Such a simulation will not be useful for direct scientific anal- 
ysis because it does not model the tidal field properly. How- 
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Table 1: Consumption of CPU-time in various parts of the code for a cosmological run from z = 50 to z = 4.3. The table gives timings for 
runs with 4, 8 and 16 processors on the Garching Cray T3E. The computation of the gravitational forces is by far the dominant computational 
task. We have further split up that time into the actual tree-walk time, the tree-construction time, the time for communication and summation of 
force contributions, and into the time lost by work-load imbalance. The potential computation is done only once in this test (it can optionally 
be done in regular intervals to check energy conservation of the code). 'Miscellaneous' refers to time spent in advancing and predicting 
particles, and in managing the binary tree for the timeline. I/O time for writing a snapshot file (groups of processors can write in parallel) is 
only 1-2 seconds, and therefore not listed here. 
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Figure 11: Code performance and scalability for a cosmological integration with vacuum boundaries. The left panel shows the speed of the 
gravitational force computation as a function of processor number (in particles per second). This is based on the tree walk time alone. In 
the simulation, additional time is needed for tree construction, work-load imbalance, communication, domain decomposition, prediction of 
particles, timeline, etc. This reduces the 'effective' speed of the code, as shown in the right panel. This effective speed gives the number of 
particles advanced by one timestep per second. Note that the only significant source of work-load imbalance in the code occurs in the gravity 
computation, where some small fraction of time is lost when processors idly wait for others to finish their tree-walks. 



ever, this test will show realistic clustering and time-stepping 
behaviour, and thus allows a reasonable assessment of the ex- 
pected computational cost and scaling of the full problem. 

We have run the test problem with GADGET on the Garch- 
ing T3E from redshift z = 50 to redshift z = 4.3. We 
repeated the identical run on partitions of size 4, 8, and 16 
processors. In this test, we included quadrupole moments in 
the tree computation, we used a BH opening criterion with 
6 = 1.0, and a gravitational softening length of 15 /i _1 kpc. 

In Table [l] we list in detail the elapsed wall-clock time for 



various parts of the code for the three simulations. The dom- 
inant sink of CPU time is the computation of gravitational 
forces for the particles. To advance the test simulation from 
z = 50 to z = 4.3, GADGET needed 30.0 x 10 6 force com- 
putations in a total of 3350 timesteps. Note that on average 
only 1.8% of the particles are advanced in a single timestep. 
Under these conditions it is challenging to eliminate sources 
of overhead incurred by the time-stepping and to maintain 
work-load balancing. GADGET solves this task satisfactorily. 
If we used a fixed timestep, the work-load balancing would be 



26 



GADGET: A code for collisionless and gasdynamical cosmological simulations 



essentially perfect. Note that the use of our adaptive timestep 
integrator results in a saving of about a factor of 3-5 compared 
to a fixed timestep scheme with the same accuracy. 

We think that the overall performance of GADGET is good 
in this test. The raw gravitational speed is high, and the algo- 
rithm used to parallelize the force computation scales well, as 
is seen in the left panel of Figure O. Note that the force-speed 
of the N p = 8 run is even higher than that of the N p = 4 run. 
This is because the domain decomposition does exactly one 
split in the x-, y-, and z-directions in the 7V P = 8 case. The 
domains are then close to cubes, which reduces the depth of 
the tree and speeds up the tree-walks. 

Also, the force communication does not involve a signifi- 
cant communication overhead, and the time spent in miscella- 
neous tasks of the simulation code scales closely with proces- 
sor number. Most losses in GADGET occur due to work-load 
imbalance in the force computation. While we think these 
losses are acceptable in the above test, one should keep in 
mind that we here kept the problem size fixed, and just in- 
creased the processor number. If we also scale up the prob- 
lem size, work-load balancing will be significantly easier to 
achieve, and the efficiency of GADGET will be nearly as high 
as for small processor number. 

Nevertheless, the computational speed of GADGET may 
seem disappointing when compared with the theoretical peak 
performance of modern microprocessors. For example, the 
processors of the T3E used for the timings have a nominal 
peak performance of 600 MFlops, but GADGET falls far short 
of reaching this. However, the peak performance can only be 
reached under the most favourable of circumstances, and typ- 
ical codes operate in a range where they are a factor of 5-10 
slower. GADGET is no exception here. While we think that 
the code does a reasonable job in avoiding unnecessary float- 
ing point operations, there is certainly room for further tuning 
measures, including processor-specific ones which haven't 
been tried at all so far. Also note that our algorithm of individ- 
ual tree walks produces essentially random access of memory 
locations, a situation that could hardly be worse for the cache 
pipeline of current microprocessors. 

7.6. Parallel SPH in a periodic volume 

As a further test of the scaling and performance of the paral- 
lel version of GADGET in typical cosmological applications 
we consider a simulation of structure formation in a periodic 
box of size (50/i _1 Mpc) 3 , including adiabatic gas physics. 
We use 32 3 dark matter particles, and 32 3 SPH particles in a 
ACDM cosmology with fi = 0.3, fl A = 0.7 and h = 0.7, 
normalized to erg = 0.9. For simplicity, initial conditions are 
constructed by displacing the particles from a grid, with the 
SPH particles placed on top of the dark matter particles. 

We have evolved these initial conditions from z = 10 to 
z = 1, using 2, 4, 8, 16, and 32 processors on the Cray T3E 
in Garching. The final particle distributions of all these runs 



8000 




Figure 12: Speed of the code in a gasdynamical simulation of a 
periodic ACDM universe, run from z = 10 to z = 1, as a func- 
tion of the number of T3E processors employed. The top panel 
shows the speed of the gravitational force computation (including 
tree construction times). We define the speed here as the number of 
force computations per elapsed wall-clock time. The middle panel 
gives the speed in the computation of hydrodynamical forces, while 
the bottom panel shows the resulting overall code speed in terms of 
particles advanced by one timestep per second. This effective code 
speed includes all other code overhead, which is less than 5% of the 
total cpu time in all runs. 



are in excellent agreement with each other. 

In the bottom panel of Figure [j~2|, we show the code speed 
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as a function of the number of processors employed. We here 
define the speed as the total number of force computations 
divided by the total elapsed wall-clock time during this test. 
Note that the scaling of the code is almost perfectly linear in 
this example, even better than for the set-up used in the pre- 
vious section. In fact, this is largely caused by the simpler 
geometry of the periodic box as compared to the spherical 
volume used earlier. The very absence of such boundary ef- 
fects makes the periodic box easier to domain-decompose, 
and to work-balance. 

The top and middle panels of Figure [l^ show the speed of 
the gravitational force computation and that of the SPH part 
separately. What we find encouraging is that the SPH algo- 
rithm scales really very well, which is promising for future 
large-scale applications. 

It is interesting that in this test (where N s = 40 SPH neigh- 
bours have been used) the hydrodynamical part actually con- 
sumes only ~ 25% of the overall CPU time. Partly this is 
due to the slower gravitational speed in this test compared to 
the results shown in Figure [H], which in turn is caused by the 
Ewald summation needed to treat the periodic boundary con- 
ditions, and by longer interaction lists in the present test (we 
here used our new opening criterion). Also note that only half 
of the particles in this simulation are SPH particles. 

We remark that the gravitational force computation will 
usually be more expensive at higher redshift than at lower 
redshift, while the SPH part does not have such a dependence. 
The fraction of time consumed by the SPH part thus tends to 
increase when the material becomes more clustered. In dissi- 
pative simulations one will typically form clumps of cold gas 
with very high density - objects that we think will form stars 
and turn into galaxies. Such cold knots of gas can slow down 
the computation substantially, because they require small hy- 
drodynamical timesteps, and if a lower spatial resolution cut- 
off for SPH is imposed, the hydrodynamical smoothing may 
start to involve more neighbours than N s . 

8. Discussion 

We have presented the numerical algorithms of our code 
GADGET, designed as a flexible tool to study a wide range 
of problems in cosmology. Typical applications of GADGET 
can include interacting and colliding galaxies, star formation 
and feedback in the interstellar medium, formation of clus- 
ters of galaxies, or the formation of large-scale structure in 
the universe. 

In fact, GADGET has already been used successfully in all 
of these areas. Using our code, Springel & White (1999) have 
studied the formation of tidal tails in colliding galaxies, and 
Springel (200C ) has modeled star formation and feedback in 
isolated and colliding gas-rich spirals. For these simulations, 
the serial version of the code was employed, both with and 
without support by the GRAPE special-purpose hardware. 



The parallel version of GADGET has been used to compute 
high-resolution N-body simulations of clusters of galaxies 
dSpringel 1999] |Springel et al. 2000j ; |Yoshida et al. 2000a| Jb|). 
In the largest simulation of this kind, 69 million particles have 
been employed, with 20 million of them ending up in the 
virialized region of a single object. The particle mass in the 
high-resolution zone was just ~ 10~ 10 of the total simulated 
mass, and the gravitational softening length was 0.7/i _1 kpc 
in a simulation volume of diameter 140 /i _1 Mpc, translating 
to an impressive spatial dynamic range of 2 x 10 5 in three 
dimensions. 

We have also successfully employed GADGET for two 
'constrained-realization' (CR) simulations of the Local Uni- 
verse (Mathis et al. 2000, in preparation). In these simu- 
lations, the observed density field as seen by IRAS galaxies 
has been used to constrain the phases of the waves of the ini- 
tial fluctuation spectrum. For each of the two CR simulations, 
we employed ~ 75 million particles in total, with 53 million 
high-resolution particles of mass 3.6 x lO 9 ft. _1 M (ACDM) 
or 1.2 x 10 w h~ 1 M (s (rCDM) in the low-density and critical- 
density models, respectively. 

The main technical features of GADGET are as follows. 
Gravitational forces are computed with a Barnes & Hut oct- 
tree, using multipole expansions up to quadrupole order. Pe- 
riodic boundary conditions can optionally be used and are im- 
plemented by means of Ewald summation. The cell-opening 
criterion may be chosen either as the standard BH-criterion, 
or a new criterion which we have shown to be computation- 
ally more efficient and better suited to cosmological simula- 
tions starting at high redshift. As an alternative to the tree- 
algorithm, the serial code can use the special-purpose hard- 
ware GRAPE both to compute gravitational forces and for the 
search for SPH neighbours. 

In our SPH implementation, the number of smoothing 
neighbors is kept exactly constant in the serial code, and 
is allowed to fluctuate in a small band in the parallel code. 
Force symmetry is achieved by using the kernel averaging 
technique, and a suitable neighbour searching algorithm is 
used to guarantee that all interacting pairs of SPH particles 
are always found. We use a shear-reduced artificial viscos- 
ity that has emerged as a good parameterization in recent 
system atic studies that compared several alternat ive formu- 
lations ( fThacker et al. 2000| ; |Eornbardi et al. 1999| ). 

Parallelization of the code for massively parallel super- 
computers is achieved in an explicit message passing ap- 
proach, using the MPI standard communication library. The 
simulation volume is spatially split using a recursive or- 
thogonal bisection, and each of the resulting domains is 
mapped onto one processor. Dynamic work-load balancing 
is achieved by measuring the computational expense incurred 
by each particle, and balancing the sum of these weights in 
the domain decomposition. 

The code allows fully adaptive, individual particle 
timesteps, both for collisionless particles and for SPH par- 
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tides. The speed-up obtained by the use of individual 
timesteps depends on the dynamic range of the time scales 
present in the problem, and on the relative population of these 
time scales with particles. For a collisionless cosmological 
simulation with a gravitational softening length larger than 
~ 30 /i _1 kpc the overall saving is typically a factor of 3 — 5. 
However, if smaller softening lengths are desired, the use of 
individual particle timesteps results in larger savings. In the 
hydrodynamical part, the savings can be still larger, espe- 
cially if dissipative physics is included. In this case, adap- 
tive timesteps may be required to make a simulation feasible 
to begin with. GADGET can be used to run simulations both 
in physical and in comoving coordinates. The latter is used 
for cosmological simulations only. Here, the code employs 
an integration scheme that can deal with arbitrary cosmolog- 
ical background models, and which is exact in linear theory, 
i.e. the linear regime can be traversed with maximum effi- 
ciency. 

GADGET is an intrinsically Lagrangian code. Both the 
gravity and the hydrodynamical parts impose no restriction 
on the geometry of the problem, nor any hard limit on the 
allowable dynamic range. Current and future simulations of 
structure formation that aim to resolve galaxies in their cor- 
rect cosmological setting will have to resolve length scales of 
size 0.1 — 1 /i _1 kpc in volumes of size ~ 100 /i~ 1 Mpc. This 
range of scales is accompanied by a similarly large dynamic 
range in mass and time scales. Our new code is essentially 
free to adapt to these scales naturally, and it invests computa- 
tional work only where it is needed. It is therefore a tool that 
should be well suited to work on these problems. 

Since GADGET is written in standard ANSI-C, and the par- 
allelization for massively parallel supercomputers is achieved 
with the standard MPI library, the code runs on a large vari- 
ety of platforms, without requiring any change. Having elim- 
inated the dependence on proprietary compiler software and 
operating systems we hope that the code will remain usable 
for the foreseeable future. We release the parallel and the se- 
rial version of GADGET publically in the hope that they will 
be useful for others as a scientific tool and as a basis for fur- 
ther numerical developments. 




Figure 13: Comparison of spline-softened (solid) and Plummer- 
softened (dotted) potential of a point mass with the Newtonian po- 
tential (dashed). Here h = 1.0, and e = ft/2.8. 
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for very insightful comments on the manuscript. 

Appendix: Softened tree nodes 

The smoothi ng kernel we use for SPH calc ulations is a spline 
of the form ( Monaghan & Lattanzio 1985 ) 



+6(1) , 0<£<i, 

0, jj>l. 

(69) 

Note that we define the smoothing kernel on the interval [0, h] 
and not on [0, 2h] as it is frequently done in other SPH calcu- 
lations. 

We derive the spline-softened gravitational force from this 
kernel by taking the force from a point mass m to be the one 
resulting from a density distribution p(r) = mW(r; h). This 
leads to a potential 
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The multipole expansion of a group of particles is discussed 
in Section (O. It results in a potential and force given by 
equations (|1 1|) and (0), respectively. The functions appear- 
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Figure 14: Comparison of spline-softened (solid) and Plummer- 
softened (dotted) force law with Newton's law (dashed). Here 
h = 1.0, and e = h/2.8. 
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Writing u — y/h, the explicit forms of these functions are 
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In Figures [0] and we show the spline-softened and 
Plummer-softened force and potential of a point mass. For 



a given spline softening length h, we define the 'equivalent' 
Plummer softening length as e = h/2.8. For this choice, the 
minimum of the potential at u = has the same depth. 
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